[
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830551#comment-17830551
]
Tim Allison commented on TIKA-4221:
-----------------------------------
This is caused by a modification of unpack200's Archive class. In
commons-compress 1.25.0, the inputstream was wrapped as a
CloseShieldInputStream and then not closed. Starting in 1.26.0, there's code
that unwraps FIlterInputStreams to get down to the source stream. This means
that this now defeats CloseShieldInputStream, and the underlying stream is
closed.
See:
https://github.com/apache/commons-compress/blob/68cd2e7fb488b4ad8a9fdc81cae97ae6e8248ea5/src/main/java/org/apache/commons/compress/harmony/unpack200/Pack200UnpackerAdapter.java#L66
This only causes problems when an unpack200 file is embedded in another file
with an ArchiveInputStream, which is why it is happening so rarely in our
corpus.
That said, this is less than ideal.
We can probably work around this by writing our own CloseShieldInputStream that
doesn't extend FilterInputStream.
> Regression in unpack200 parsing in commons-compress
> ---------------------------------------------------
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> We noticed ~10 xz files with fewer attachments in the recent regression tests
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem,
> but not a blocker (IMHO).
> The stacktrace from
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
> looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception :
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.DefaultParser@56a4479a
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
> at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
> at
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
> at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
> at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
> at
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
> at
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
> at
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
> at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
> at
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
> at java.io.FilterInputStream.available(FilterInputStream.java:168)
> at
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
> at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
> at java.io.FilterInputStream.available(FilterInputStream.java:168)
> at
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
> at java.io.FilterInputStream.available(FilterInputStream.java:168)
> at
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
> at
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
> at
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ... 85 more
--
This message was sent by Atlassian Jira
(v8.20.10#820010)