I've used POI, as well as commercial providers. As always, it
depends :-) I wasn't particularly impressed with the commercial
providers given the amount of money they wanted for it. PDF was
particularly tricky, but you weren't asking about that. At least w/
POI, you have the opportunity to fix things that don't work based on
your priorities. I don't know what the failure rate is for the
commercial providers, but my experience is they will all fail at least
once, so you better plan on it. I'd look to use a framework like Tika
or Aperture, where you can easily upgrade or plug in new or different
libraries (including commercial providers) as needed w/o rewriting
your code. Additionally, with something like Tika or Aperture, you
could easily mix and match your solutions, such that you use one for
Word and a different one for PPT or PDF.
One issue with any of them is how you plan to use them. If you need
more than bag of words, they all get less reliable, especially when it
comes to PDFs and Office docs. Dealing with things like tables,
columns, captions, labels, etc. has always been problematic in my
experience when one wants to do higher level processing (beyond
keyword search).
HTH,
Grant
On May 12, 2008, at 10:03 AM, Lukas Vlcek wrote:
Hi,
I need to find a reliable way how to extract content out of Word,
Excel and
PowerPoint formats prior to indexing and I am not sure if POI is the
best
way to go. Can anybody share experience with POI and/or other
[commercial]
Java library for text extraction from MS formats?
My experience with POI is such that sometimes it can be a pain to
get the
content out of the MS files properly. I also know that Nutch plugin
uses POI
for MS formats but as far as I remember it is not 100% reliable (my
more
then one year old experience is that about 1-2% of files were not
parsed).
My requirements are that the text extraction software must run on
Linux and
should be written in Java, it can be open source or commercial
library.
Regards,
Lukas
--
http://blog.lukas-vlcek.com/
--------------------------
Grant Ingersoll
http://lucidimagination.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]