excel for indexing the actual content

Jack Krupansky Sun, 27 Jan 2013 10:18:00 -0800

You may be able to use Tika directly without needing to choose the specificclasses, although the latter may give you the specific data you need withoutthe extra overhead.

You could take a look at the Solr Extracting Request Handler source for anexample:

http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/

Basically, Tika extracts a bunch of "metadata" and then you will have to addselected metadata to your Lucene documents. "content" is the main documentbody text.


You could try Solr itself to see how it works:
http://wiki.apache.org/solr/ExtractingRequestHandler

-- Jack Krupansky

-----Original Message-----From: Adrien Grand

Sent: Sunday, January 27, 2013 12:53 PM
To: java-user@lucene.apache.org

Subject: Re: Readers for extracting textual info from pd/doc/excel forindexing the actual content


Have you tried using the PDFParser [1] and the OfficeParser [2]
classes from Tika?

This question seems to be more appropriate for the Tika user mailing list[3]?

[1]http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream,

org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)

[2]http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream,

org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[3] http://tika.apache.org/mail-lists.html

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

Reply via email to