[ 
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504588#comment-13504588
 ] 

Nick Burch commented on TIKA-1033:
----------------------------------

The "raw chart object" looks to actually be an excel file, running 
org.apache.poi.poifs.dev.POIFSLister against it gives:

  Root Entry -
    CompObj <(0x01)CompObj>
    Workbook
    Ole <(0x01)Ole>

So there's an excel workbook in there. POIFSViewer shows the only bit with any 
real data in it is the Workbook entry, and bits of text from the chart are 
there, so whatever the chart data is it's in the excel file part. That's why 
Tika is saying it's an excel file!

Note that embedded objects in office files are actually stored as the raw 
object (used for editing), and a rendered version of the file (so that viewing 
the parent document is quick, normally an EMF)
                
> Tika doesn't parse embedded OLE Chart/Graph objects
> ---------------------------------------------------
>
>                 Key: TIKA-1033
>                 URL: https://issues.apache.org/jira/browse/TIKA-1033
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: emb.ppt
>
>
> I have an example ppt that embeds a chart, but Tika mis-identifies it
> as an XLS document.
> The progID (oleShape.getProgID() in
> HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
> we seem to detect it as Excel (application/vnd.ms-excel) but then the
> ExcelExtractor hits this exception:
> {noformat}
> org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
> instance
>       at 
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
>       at 
> org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
>       at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
>       at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
>       at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
>       at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
> {noformat}
> Since DelegatingParser silently suppresses all exceptions, when you
> run TikaCLI you won't see any exception nor text extracted, but if you
> run with -z, it will save 1.xls which if you then try to parse with
> TikaCLI hits the above exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to