high failure rate in WordDocument.writeAllText() extraction?

ahnf Mon, 12 Feb 2007 12:17:05 -0800

Hi,
We have roughly ~1900 MS Word documents in a file repository that is used in a 
DAM system. We have a need to simply extract text from the word documents for 
indexing purposes and figured we would give POI a try. We have tried using the 
stable 2.5.1 release as well as the alpha code, both with simular results of 
high failure percentages.


Using WordDocument.writeAllText() 

SUCCESS= 1341 FAIL=585

Here are the 3 main exceptions we constantly get (below)

Using WordExtractor.getText() get < 10 failures
------------------------------------------------------------------------------------------------------

java.io.IOException: Invalid header signature; read 7015536635646467195, 
expected -2226271756974174256
    at 
org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:91)
    at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
    at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:189)


java.lang.NegativeArraySizeException
    at 
org.apache.poi.hdf.extractor.data.ListTables.createLVL(ListTables.java:176)
    at org.apache.poi.hdf.extractor.data.ListTables.initLFO(ListTables.java:148)
    at org.apache.poi.hdf.extractor.data.ListTables.<init>(ListTables.java:42)
    at 
org.apache.poi.hdf.extractor.WordDocument.createListTables(WordDocument.java:1639)
    at 
org.apache.poi.hdf.extractor.WordDocument.findFormatting(WordDocument.java:364)
    at 
org.apache.poi.hdf.extractor.WordDocument.processComplexFile(WordDocument.java:291)
    at org.apache.poi.hdf.extractor.WordDocument.readFIB(WordDocument.java:243)
    at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:193)



java.lang.ArrayIndexOutOfBoundsException: 396
    at org.apache.poi.hdf.extractor.Utils.convertBytesToShort(Utils.java:47)
    at 
org.apache.poi.hdf.extractor.data.ListTables.createLVL(ListTables.java:175)
    at org.apache.poi.hdf.extractor.data.ListTables.initLFO(ListTables.java:148)
    at org.apache.poi.hdf.extractor.data.ListTables.<init>(ListTables.java:42)
    at 
org.apache.poi.hdf.extractor.WordDocument.createListTables(WordDocument.java:1639)
    at 
org.apache.poi.hdf.extractor.WordDocument.findFormatting(WordDocument.java:364)
    at 
org.apache.poi.hdf.extractor.WordDocument.processComplexFile(WordDocument.java:291)
    at org.apache.poi.hdf.extractor.WordDocument.readFIB(WordDocument.java:243)
    at org.apache.poi.hdf.extractor.WordDocument.<init>(WordDocument.java:193)
    at 
org.openmrm.core.file.service.POIConverterService.executeConversion(POIConverterService.java:147)









 
---------------------------------
Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.

high failure rate in WordDocument.writeAllText() extraction?

Reply via email to