[
https://issues.apache.org/jira/browse/TIKA-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830551#comment-17830551
]
Tim Allison edited comment on TIKA-4221 at 3/25/24 5:09 PM:
------------------------------------------------------------
This is caused by a modification of pack200's Archive class. In
commons-compress 1.25.0, the inputstream was wrapped as a
CloseShieldInputStream and then not closed. Starting in 1.26.0, there's code
that unwraps FIlterInputStreams to get down to the source stream. This means
that this now defeats CloseShieldInputStream, and the underlying stream is
closed.
See:
https://github.com/apache/commons-compress/blob/68cd2e7fb488b4ad8a9fdc81cae97ae6e8248ea5/src/main/java/org/apache/commons/compress/harmony/unpack200/Pack200UnpackerAdapter.java#L66
This only causes problems when an pack200 file is embedded in another file with
an ArchiveInputStream, which is why it is happening so rarely in our corpus.
That said, this is less than ideal.
We can probably work around this by writing our own CloseShieldInputStream that
doesn't extend FilterInputStream.
was (Author: [email protected]):
This is caused by a modification of unpack200's Archive class. In
commons-compress 1.25.0, the inputstream was wrapped as a
CloseShieldInputStream and then not closed. Starting in 1.26.0, there's code
that unwraps FIlterInputStreams to get down to the source stream. This means
that this now defeats CloseShieldInputStream, and the underlying stream is
closed.
See:
https://github.com/apache/commons-compress/blob/68cd2e7fb488b4ad8a9fdc81cae97ae6e8248ea5/src/main/java/org/apache/commons/compress/harmony/unpack200/Pack200UnpackerAdapter.java#L66
This only causes problems when an unpack200 file is embedded in another file
with an ArchiveInputStream, which is why it is happening so rarely in our
corpus.
That said, this is less than ideal.
We can probably work around this by writing our own CloseShieldInputStream that
doesn't extend FilterInputStream.
> Regression in pack200 parsing in commons-compress
> -------------------------------------------------
>
> Key: TIKA-4221
> URL: https://issues.apache.org/jira/browse/TIKA-4221
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> There's a regression in pack200 that leads to the InputStream being closed
> even if wrapped in a CloseShieldInputStream.
> This was the original signal that something was wrong, but the real problem
> is in pack200, not xz.
> We noticed ~10 xz files with fewer attachments in the recent regression tests
> in prep for the 2.9.2 release. This is 10 out of ~4500. So, it's a problem,
> but not a blocker (IMHO).
> The stacktrace from
> {{https://corpora.tika.apache.org/base/docs/commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH}}
> looks like this:
> 3: X-TIKA:EXCEPTION:embedded_exception :
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.DefaultParser@56a4479a
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
> at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
> at
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
> at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
> at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
> at
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:229)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
> at
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:446)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:436)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:424)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:418)
> at
> org.apache.tika.parser.AutoDetectParserTest.oneOff(AutoDetectParserTest.java:563)
> ...
> Caused by: org.tukaani.xz.XZIOException: Stream closed
> at org.tukaani.xz.SingleXZInputStream.available(Unknown Source)
> at
> org.apache.commons.compress.compressors.xz.XZCompressorInputStream.available(XZCompressorInputStream.java:115)
> at java.io.FilterInputStream.available(FilterInputStream.java:168)
> at
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
> at java.io.BufferedInputStream.available(BufferedInputStream.java:410)
> at java.io.FilterInputStream.available(FilterInputStream.java:168)
> at
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
> at java.io.FilterInputStream.available(FilterInputStream.java:168)
> at
> org.apache.commons.io.input.ProxyInputStream.available(ProxyInputStream.java:84)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:800)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:412)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:389)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:49)
> at
> org.apache.tika.parser.pkg.PackageParser.parseEntries(PackageParser.java:389)
> at
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:329)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ... 85 more
--
This message was sent by Atlassian Jira
(v8.20.10#820010)