You may be able to use Tika directly without needing to choose the specific
classes, although the latter may give you the specific data you need without
the extra overhead.
You could take a look at the Solr Extracting Request Handler source for an
example:
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/
Basically, Tika extracts a bunch of "metadata" and then you will have to add
selected metadata to your Lucene documents. "content" is the main document
body text.
You could try Solr itself to see how it works:
http://wiki.apache.org/solr/ExtractingRequestHandler
-- Jack Krupansky
-----Original Message-----
From: Adrien Grand
Sent: Sunday, January 27, 2013 12:53 PM
To: java-user@lucene.apache.org
Subject: Re: Readers for extracting textual info from pd/doc/excel for
indexing the actual content
Have you tried using the PDFParser [1] and the OfficeParser [2]
classes from Tika?
This question seems to be more appropriate for the Tika user mailing list
[3]?
[1]
http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[2]
http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[3] http://tika.apache.org/mail-lists.html
--
Adrien
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org