You may be able to use Tika directly without needing to choose the specific classes, although the latter may give you the specific data you need without the extra overhead.

You could take a look at the Solr Extracting Request Handler source for an example:
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/

Basically, Tika extracts a bunch of "metadata" and then you will have to add selected metadata to your Lucene documents. "content" is the main document body text.

You could try Solr itself to see how it works:
http://wiki.apache.org/solr/ExtractingRequestHandler

-- Jack Krupansky

-----Original Message----- From: Adrien Grand
Sent: Sunday, January 27, 2013 12:53 PM
To: java-user@lucene.apache.org
Subject: Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

Have you tried using the PDFParser [1] and the OfficeParser [2]
classes from Tika?

This question seems to be more appropriate for the Tika user mailing list [3]?

[1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[2] http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[3] http://tika.apache.org/mail-lists.html

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to