Also, that chapter (7) has been rewritten in the revised Lucene in
Action (available through Manning's early access now); it's now based
entirely on Tika.
But, note that Tika only just recently is able to extract text from
Office 2007 (I think):
https://issues.apache.org/jira/browse/TIKA-152
you'll have to build off of trunk or use the SNAPSHOT from Maven.
Mike
Otis Gospodnetic wrote:
Hi,
POI - http://poi.apache.org/
or
Tika (it uses POI) - http://lucene.apache.org/tika
And you can use code from Lucene in Action to index the text with
Lucene - http://manning.com/hatcher2 . The code is free to download.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: "Zhang, Lisheng" <lisheng.zh...@broadvision.com>
To: java-user@lucene.apache.org
Sent: Sunday, February 22, 2009 2:27:06 PM
Subject: Text extraction tool for Microsoft Office 2007
Hi,
What is the best tool (free software) to extract text from
Microsoft Office 2007:
Word 2007, Excel 2007, Power Point 2007
so that we can index them by lucene?
Thanks very much for helps, Lisheng
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org