[ 
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504563#comment-13504563
 ] 

Michael McCandless commented on TIKA-1033:
------------------------------------------

Here's the full stack trace when I parse the .xls file that TikaCLI extracts:
{noformat}
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4eaf6cb1
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121)
Caused by: org.apache.poi.hssf.record.RecordFormatException: Unable to 
construct record instance
        at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
        at 
org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
        at 
org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
        at 
org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
        at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
        at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
        at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:292)
        at 
org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:144)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 5 more
Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data 
(0) to read requested (2) bytes
        at 
org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216)
        at 
org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:233)
        at 
org.apache.poi.hssf.record.WindowOneRecord.<init>(WindowOneRecord.java:71)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
        at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57)
        ... 15 more
{noformat}
                
> Tika doesn't parse embedded OLE Chart/Graph objects
> ---------------------------------------------------
>
>                 Key: TIKA-1033
>                 URL: https://issues.apache.org/jira/browse/TIKA-1033
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: emb.ppt
>
>
> I have an example ppt that embeds a chart, but Tika mis-identifies it
> as an XLS document.
> The progID (oleShape.getProgID() in
> HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
> we seem to detect it as Excel (application/vnd.ms-excel) but then the
> ExcelExtractor hits this exception:
> {noformat}
> org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
> instance
>       at 
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
>       at 
> org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
>       at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
>       at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
>       at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
>       at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
> {noformat}
> Since DelegatingParser silently suppresses all exceptions, when you
> run TikaCLI you won't see any exception nor text extracted, but if you
> run with -z, it will save 1.xls which if you then try to parse with
> TikaCLI hits the above exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to