On Thu, 9 Feb 2006, Christiaan Fluit wrote:
My experience is that the WordDocument class crashes on about 25% of the documents, i.e. it throws some sort of Exception. I've tested POI 2.5.1-final as well as the current code in CVS, but both produce this result. I even suspect the output to be 100% the same, but I haven't verified this.

You could try using org.apache.poi.hwpf.HWPFDocument, and getting the range, then the paragraphs, and grab the text from each paragraph. If there's interest, I could probably commit an extractor that does this to poi.

(WordDocument is from the hdf package, which is older and less reliable than the current hwpf stuff)

Another reason I don't like this class is that it operates on an InputStream and internally creates a POIFSFileSystem which you cannot access, so that it becomes hard to extract document metadata as well (for which you need the PFSFS) without buffering the entire InputStream.

If you're using HWPFDocument from cvs, then you can create that from a POIFSFileSystem.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to