[
https://issues.apache.org/jira/browse/TIKA-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301278#comment-17301278
]
Nick Burch commented on TIKA-3316:
----------------------------------
Mimetype wise, my view, for what it's worth...
It depends on how the OpenXPS format is related to the main XPS one
If it's basically the same, with just a few tweaks, I'd say add a `;
format=openxps` clarifier to it, in common with a few other formats. Lets those
who are interested know we spotted the exact type, but no major change for most
users
If it needs moderate tweaks, I'd suggest adding a new subtype of the current
XPS type for this
If the two XPS variants are largely different, such that you need different
parsers, I'd suggest adding a new `x-tika-xps` or similar type, make that the
parent of both the current and new, and possibly attach the XPS file extension
to the parent. Helps signpost that they're not the same even if they seem it!
> Illegal IOException processing XPS files
> ----------------------------------------
>
> Key: TIKA-3316
> URL: https://issues.apache.org/jira/browse/TIKA-3316
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 1.25
> Reporter: Nick Harmer
> Assignee: Tim Allison
> Priority: Major
> Fix For: 1.26
>
> Attachments: Screenshot from 2021-03-12 17-00-05.png, test1.xps,
> test2.xps, test3.xps, test4.xps
>
>
> I have a number of (relatively simple) XPS documents which Tika fails to
> process. The following exception appears:
> {code:java}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4149c063
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
> at com.mcms.Main.parseFile(Main.java:88)
> at com.mcms.Main.main(Main.java:59)
> Caused by:
> org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException:
> Unsupported feature data descriptor used in entry
> Documents/1/Metadata/Page1_Thumbnail.JPG
> at
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:477)
> at java.base/java.io.FilterInputStream.read(Unknown Source)
> at
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.read(ZipArchiveThresholdInputStream.java:80)
> at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:182)
> at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149)
> at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:136)
> at
> org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47)
> at
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53)
> at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106)
> at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:111)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 5 more
> {code}
>
> Obviously the generator for these files (XPS printer driver from Notepad)
> adds a per-page thumbnail image which Tika doesn't like.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)