[ 
https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097261#comment-13097261
 ] 

Michael McCandless commented on TIKA-705:
-----------------------------------------

Thanks for looking at this Nick!

So, is this something I somehow screwed up using Powerpoint 2007?  Or 
PowerPoint 2007 is simply producing an invalid OOXML file?

Is there anything we (or POI) can do here?  It's bad if users can produce 
things "normally" (ie just using PowerPoint) which Tika then chokes on...

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as 
> various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken 
> OOXML file
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A 
> segment shall not hold any characters other than pchar characters. [M1.6]
>       at 
> org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
>       at 
> org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
>       at 
> org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
>       at 
> org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
>       at 
> org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
>       ... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, 
> and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is 
> buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to