Grant Ingersoll wrote:
I've used POI, as well as commercial providers. As always, it depends :-) I wasn't particularly impressed with the commercial providers given the amount of money they wanted for it. PDF was particularly tricky, but you weren't asking about that. At least w/ POI, you have the opportunity to fix things that don't work based on your priorities. I don't know what the failure rate is for the commercial providers, but my experience is they will all fail at least once, so you better plan on it. I'd look to use a framework like Tika or Aperture, where you can easily upgrade or plug in new or different libraries (including commercial providers) as needed w/o rewriting your code. Additionally, with something like Tika or Aperture, you could easily mix and match your solutions, such that you use one for Word and a different one for PPT or PDF.

One issue with any of them is how you plan to use them. If you need more than bag of words, they all get less reliable, especially when it comes to PDFs and Office docs. Dealing with things like tables, columns, captions, labels, etc. has always been problematic in my experience when one wants to do higher level processing (beyond keyword search).

Yet another option ... In the past I used a licensed copy of MS Office to extract things that I wanted, using a bit of OLE automation and VBscript. Worked reasonably well, in the sense that I had no issues whatsoever with extracting the content _and_ formatting from any documents that could be normally opened with MS Office - however, performance was an issue, ie. it was slow, CPU/memory hog, and occasionally it would get stuck in a weird state when only complete reboot would help.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to