On Sat, 3 Jul 2010, Antoni Mylka wrote:
Did you try to compare the Word6Extractor against the one from text mining? How well does it extract text?

For my test documents, the POI code now does better than text mining does. This is because we process not just the CHPX character properties, but also the PAPX (paragraph) and SECX (section) tables too, which means we know where the paragraphs are.

(However, we don't support decompressing the PAPX/CHPX properties, so you can't tell how a text run is formatted, only that it's different. If anyone cares, you'll need to figure out the differences in the style table between the old and the new format)

BTW, http://code.google.com/p/text-mining/ contains examples of fastsaved files you could use in your tests, they probably can't be committed to ASF for legal reasons (can they????), but they make great tests nonetheless.

The text mining library is now LGPL, so we can't commit their test files to POI. If you fancy trying one of their sample fastsaved word 6 or 95 files with Word6Extractor, I'd be interested to hear how it goes!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to