[ 
https://issues.apache.org/jira/browse/TIKA-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301278#comment-17301278
 ] 

Nick Burch commented on TIKA-3316:
----------------------------------

Mimetype wise, my view, for what it's worth...

It depends on how the OpenXPS format is related to the main XPS one

If it's basically the same, with just a few tweaks, I'd say add a `; 
format=openxps` clarifier to it, in common with a few other formats. Lets those 
who are interested know we spotted the exact type, but no major change for most 
users

If it needs moderate tweaks, I'd suggest adding a new subtype of the current 
XPS type for this

If the two XPS variants are largely different, such that you need different 
parsers, I'd suggest adding a new `x-tika-xps` or similar type, make that the 
parent of both the current and new, and possibly attach the XPS file extension 
to the parent. Helps signpost that they're not the same even if they seem it!

> Illegal IOException processing XPS files
> ----------------------------------------
>
>                 Key: TIKA-3316
>                 URL: https://issues.apache.org/jira/browse/TIKA-3316
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.25
>            Reporter: Nick Harmer
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.26
>
>         Attachments: Screenshot from 2021-03-12 17-00-05.png, test1.xps, 
> test2.xps, test3.xps, test4.xps
>
>
> I have a number of (relatively simple) XPS documents which Tika fails to 
> process.  The following exception appears:
> {code:java}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4149c063
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
>         at com.mcms.Main.parseFile(Main.java:88)
>         at com.mcms.Main.main(Main.java:59)
> Caused by: 
> org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: 
> Unsupported feature data descriptor used in entry 
> Documents/1/Metadata/Page1_Thumbnail.JPG
>         at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:477)
>         at java.base/java.io.FilterInputStream.read(Unknown Source)
>         at 
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.read(ZipArchiveThresholdInputStream.java:80)
>         at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:182)
>         at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149)
>         at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:136)
>         at 
> org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47)
>         at 
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53)
>         at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106)
>         at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
>         at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:111)
>         at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         ... 5 more
> {code}
>  
> Obviously the generator for these files (XPS printer driver from Notepad) 
> adds a per-page thumbnail image which Tika doesn't like.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to