[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504563#comment-13504563 ]
Michael McCandless commented on TIKA-1033: ------------------------------------------ Here's the full stack trace when I parse the .xls file that TikaCLI extracts: {noformat} Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4eaf6cb1 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121) Caused by: org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:292) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (2) bytes at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216) at org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:233) at org.apache.poi.hssf.record.WindowOneRecord.<init>(WindowOneRecord.java:71) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57) ... 15 more {noformat} > Tika doesn't parse embedded OLE Chart/Graph objects > --------------------------------------------------- > > Key: TIKA-1033 > URL: https://issues.apache.org/jira/browse/TIKA-1033 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Priority: Minor > Attachments: emb.ppt > > > I have an example ppt that embeds a chart, but Tika mis-identifies it > as an XLS document. > The progID (oleShape.getProgID() in > HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and > we seem to detect it as Excel (application/vnd.ms-excel) but then the > ExcelExtractor hits this exception: > {noformat} > org.apache.poi.hssf.record.RecordFormatException: Unable to construct record > instance > at > org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) > at > org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) > at > org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) > at > org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) > at > org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147) > {noformat} > Since DelegatingParser silently suppresses all exceptions, when you > run TikaCLI you won't see any exception nor text extracted, but if you > run with -z, it will save 1.xls which if you then try to parse with > TikaCLI hits the above exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira