Hi,
We have roughly ~1900 MS Word documents in a file repository that is used in a
DAM system. We have a need to simply extract text from the word documents for
indexing purposes and figured we would give POI a try. We have tried using the
stable 2.5.1 release as well as the alpha code, both with simular results of
high failure percentages.
Using WordDocument.writeAllText()
SUCCESS= 1341 FAIL=585
Here are the 3 main exceptions we constantly get (below)
Using WordExtractor.getText() get < 10 failures
------------------------------------------------------------------------------------------------------
java.io.IOException: Invalid header signature; read 7015536635646467195,
expected -2226271756974174256
at
org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:91)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:189)
java.lang.NegativeArraySizeException
at
org.apache.poi.hdf.extractor.data.ListTables.createLVL(ListTables.java:176)
at org.apache.poi.hdf.extractor.data.ListTables.initLFO(ListTables.java:148)
at org.apache.poi.hdf.extractor.data.ListTables.<init>(ListTables.java:42)
at
org.apache.poi.hdf.extractor.WordDocument.createListTables(WordDocument.java:1639)
at
org.apache.poi.hdf.extractor.WordDocument.findFormatting(WordDocument.java:364)
at
org.apache.poi.hdf.extractor.WordDocument.processComplexFile(WordDocument.java:291)
at org.apache.poi.hdf.extractor.WordDocument.readFIB(WordDocument.java:243)
at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:193)
java.lang.ArrayIndexOutOfBoundsException: 396
at org.apache.poi.hdf.extractor.Utils.convertBytesToShort(Utils.java:47)
at
org.apache.poi.hdf.extractor.data.ListTables.createLVL(ListTables.java:175)
at org.apache.poi.hdf.extractor.data.ListTables.initLFO(ListTables.java:148)
at org.apache.poi.hdf.extractor.data.ListTables.<init>(ListTables.java:42)
at
org.apache.poi.hdf.extractor.WordDocument.createListTables(WordDocument.java:1639)
at
org.apache.poi.hdf.extractor.WordDocument.findFormatting(WordDocument.java:364)
at
org.apache.poi.hdf.extractor.WordDocument.processComplexFile(WordDocument.java:291)
at org.apache.poi.hdf.extractor.WordDocument.readFIB(WordDocument.java:243)
at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:193)
at
org.openmrm.core.file.service.POIConverterService.executeConversion(POIConverterService.java:147)
---------------------------------
Any questions? Get answers on any topic at Yahoo! Answers. Try it now.