Seva Alekseyev created TIKA-2199: ------------------------------------ Summary: RecordFormatException on a valid Excel file Key: TIKA-2199 URL: https://issues.apache.org/jira/browse/TIKA-2199 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: CDC survcost.xls
The attached file, which opens in Excel, causes an error in Tika parser: org.apache.poi.util.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:98 at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345 at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307 at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.IllegalArgumentException: Start index must be less than end index. at org.apache.poi.hssf.usermodel.HSSFRichTextString.applyFont:136 at org.apache.poi.hssf.record.TextObjectRecord.processFontRuns:155 at org.apache.poi.hssf.record.TextObjectRecord.<init>:131 at sun.reflect.GeneratedConstructorAccessor19.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create:84 at org.apache.poi.hssf.record.RecordFactory.createSingleRecord:345 at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord:307 at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord:273 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:175 at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136 at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312 at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:177 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 at gov.nih.niaid.fscanner.Extract.ExtractContents:69 -- This message was sent by Atlassian JIRA (v6.3.4#6332)