Rob Vesse created TIKA-4469:
-------------------------------
Summary: After upgrading to 3.2.2 most files are incorrectly
treated as Archive's by AutoDetectParser
Key: TIKA-4469
URL: https://issues.apache.org/jira/browse/TIKA-4469
Project: Tika
Issue Type: Bug
Components: detector
Affects Versions: 3.2.2
Reporter: Rob Vesse
Attachments: test.pdf
We had an application that was working fine with 3.2.1, after Dependabot
suggested an upgrade to 3.2.2 the builds for that PR were failing. On
investigation it was found that with 3.2.2 Tika the {{AutoDetectParser}} seems
to treat every file as potentially being an archive file and then fails because
it actually isn't:
{noformat}
Caused by: org.apache.commons.compress.archivers.ArchiveException: No Archiver
found for the stream signature
at
org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(ArchiveStreamFactory.java:295)
at
org.apache.tika.detect.zip.DefaultZipContainerDetector.detectArchiveFormat(DefaultZipContainerDetector.java:122)
at
org.apache.tika.detect.zip.DefaultZipContainerDetector.detect(DefaultZipContainerDetector.java:180)
at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:179)
{noformat}
Code is pretty straightforward (simplified to take out some application
implementation detail):
{noformat}
Metadata tikaMetadata = new Metadata();
tikaMetadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.pdf");
tikaMetadata.set("Content-Type", "application/pdf");
// Use a BodyContentHandler as we just want the textual output
BodyContentHandler handler = new BodyContentHandler(-1);
// Prepare a Tika parse context
ParseContext context = new ParseContext();
// Actually parse the document and then produce the output event
// NB - input here in real code is a ByteArrayInputStream as these documents
are coming to our code via a Kafka topic
AutoDetectParser parser = new AutoDetectParser();
parser.parse(input, handler, tikaMetadata, context);
{noformat}
I have attached the example {{test.pdf}} to this ticket. Note that this bug
happens with all files types, including things like plain text.
The "fix" for using 3.2.2 seems to be to change how we set the Tika metadata to
instead use {{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE}}.
However if the file isn't successfully detected as an archive I would expect
Tika to fallback to trying other content detectors rather than bailing out
early, as this was the behaviour prior to 3.2.2 and tests with this file, and
other files, were working fine prior to 3.2.2.
I suspect this bug is most likely related to the fix for TIKA-4424
--
This message was sent by Atlassian Jira
(v8.20.10#820010)