Hi
I am trying to extract pure text from Word (to index into Lucene):
I did:
* org.apache.poi.hwpf.extractor.WordExtractor we=new
org.apache.poi.hwpf.extractor.WordExtractor(is);
bodyText=we.getText();
*
I tested it on 48 documents, which are mostly quite easy (don't contain
pictures or so) but some of them are quite old (from 2000 or so).
I get exception on 47 of the 48 documents... the stack trace is, for
instance:
**
*java.io.IOException: Invalid header signature; read 7015536635646467195,
expected -2226271756974174256
at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(
HeaderBlockReader.java:91)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(
POIFSFileSystem.java:83)
at org.apache.poi.hdf.extractor.WordDocument.<init>(
WordDocument.java:193)
*
Would love to get replies.