[
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226487#comment-15226487
]
Hudson commented on TIKA-1033:
------------------------------
SUCCESS: Integrated in tika-2.x #75 (See
[https://builds.apache.org/job/tika-2.x/75/])
TIKA-1033 -- add identification for embedded MSChart.Graph files. (tallison:
rev 862234289514dede8362c04f64305a47b0580ec8)
*
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/POIContainerExtractionTest.java
* tika-core/src/test/java/org/apache/tika/TikaTest.java
*
tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.xls
*
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
*
tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.xlsx
*
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/AbstractPOIContainerExtractionTest.java
*
tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.ppt
*
tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.pptx
*
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
*
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java
* CHANGES.txt
> Tika doesn't parse embedded OLE Chart/Graph objects
> ---------------------------------------------------
>
> Key: TIKA-1033
> URL: https://issues.apache.org/jira/browse/TIKA-1033
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Michael McCandless
> Priority: Minor
> Attachments: emb.ppt, testMSChart-govdocs-428996.pptx
>
>
> I have an example ppt that embeds a chart, but Tika mis-identifies it
> as an XLS document.
> The progID (oleShape.getProgID() in
> HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
> we seem to detect it as Excel (application/vnd.ms-excel) but then the
> ExcelExtractor hits this exception:
> {noformat}
> org.apache.poi.hssf.record.RecordFormatException: Unable to construct record
> instance
> at
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
> at
> org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
> at
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
> at
> org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
> at
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
> at
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
> at
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
> at
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
> {noformat}
> Since DelegatingParser silently suppresses all exceptions, when you
> run TikaCLI you won't see any exception nor text extracted, but if you
> run with -z, it will save 1.xls which if you then try to parse with
> TikaCLI hits the above exception.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)