W dniu 2010-07-02 23:04, Nick Burch pisze:
Hi All
As you might've seen from my commits in the last few days, I've added
some initial support to HWPF for word 6 and word 95 files. I've only
been working with a view to doing text extraction (so I can ditch the
text mining library from a work project). With lots of trial and error,
some offset tips from WV's FIB parsing code, and some refactoring, we
can now get text and paragraphs out of word 6 and word 95 files!
To play with this, you'll want HWPFOldDocument / Word6Extractor (catch
OldWordFileFormatException and switch to the old one as needed)
I've got this working with various sample files producing by doing
save-as from newer software. This means that it's not impossible that
real Word 6 / Word 95 files will break it, especially if they're
quick-saved (I didn't have any examples)
As usual, please upload files that don't work to new bugzilla entries,
or even better upload the broken file and the patch that fixes it :)
A great idea.
Did you try to compare the Word6Extractor against the one from text
mining? How well does it extract text?
BTW, http://code.google.com/p/text-mining/ contains examples of
fastsaved files you could use in your tests, they probably can't be
committed to ASF for legal reasons (can they????), but they make great
tests nonetheless.
Antoni Myłka
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]