Hi

I am trying to extract pure text from Word (to index into Lucene):
I did:
*            org.apache.poi.hwpf.extractor.WordExtractor we=new
org.apache.poi.hwpf.extractor.WordExtractor(is);
           bodyText=we.getText();
*
I tested it on 48 documents, which are mostly quite easy (don't contain
pictures or so) but some of them are quite old (from 2000 or so).

I get exception on 47 of the 48 documents... the stack trace is, for
instance:
**
*java.io.IOException: Invalid header signature; read 7015536635646467195,
expected -2226271756974174256
       at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(
HeaderBlockReader.java:91)
       at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(
POIFSFileSystem.java:83)
       at org.apache.poi.hdf.extractor.WordDocument.<init>(
WordDocument.java:193)
*
Would love to get replies.

Reply via email to