Grant Ingersoll wrote:
I've used POI, as well as commercial providers. As always, it depends
:-) I wasn't particularly impressed with the commercial providers given
the amount of money they wanted for it. PDF was particularly tricky,
but you weren't asking about that. At least w/ POI, you have the
opportunity to fix things that don't work based on your priorities. I
don't know what the failure rate is for the commercial providers, but my
experience is they will all fail at least once, so you better plan on
it. I'd look to use a framework like Tika or Aperture, where you can
easily upgrade or plug in new or different libraries (including
commercial providers) as needed w/o rewriting your code. Additionally,
with something like Tika or Aperture, you could easily mix and match
your solutions, such that you use one for Word and a different one for
PPT or PDF.
One issue with any of them is how you plan to use them. If you need
more than bag of words, they all get less reliable, especially when it
comes to PDFs and Office docs. Dealing with things like tables,
columns, captions, labels, etc. has always been problematic in my
experience when one wants to do higher level processing (beyond keyword
search).
Yet another option ... In the past I used a licensed copy of MS Office
to extract things that I wanted, using a bit of OLE automation and
VBscript. Worked reasonably well, in the sense that I had no issues
whatsoever with extracting the content _and_ formatting from any
documents that could be normally opened with MS Office - however,
performance was an issue, ie. it was slow, CPU/memory hog, and
occasionally it would get stuck in a weird state when only complete
reboot would help.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]