Rob Vesse created TIKA-4469: ------------------------------- Summary: After upgrading to 3.2.2 most files are incorrectly treated as Archive's by AutoDetectParser Key: TIKA-4469 URL: https://issues.apache.org/jira/browse/TIKA-4469 Project: Tika Issue Type: Bug Components: detector Affects Versions: 3.2.2 Reporter: Rob Vesse Attachments: test.pdf
We had an application that was working fine with 3.2.1, after Dependabot suggested an upgrade to 3.2.2 the builds for that PR were failing. On investigation it was found that with 3.2.2 Tika the {{AutoDetectParser}} seems to treat every file as potentially being an archive file and then fails because it actually isn't: {noformat} Caused by: org.apache.commons.compress.archivers.ArchiveException: No Archiver found for the stream signature at org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(ArchiveStreamFactory.java:295) at org.apache.tika.detect.zip.DefaultZipContainerDetector.detectArchiveFormat(DefaultZipContainerDetector.java:122) at org.apache.tika.detect.zip.DefaultZipContainerDetector.detect(DefaultZipContainerDetector.java:180) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:179) {noformat} Code is pretty straightforward (simplified to take out some application implementation detail): {noformat} Metadata tikaMetadata = new Metadata(); tikaMetadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.pdf"); tikaMetadata.set("Content-Type", "application/pdf"); // Use a BodyContentHandler as we just want the textual output BodyContentHandler handler = new BodyContentHandler(-1); // Prepare a Tika parse context ParseContext context = new ParseContext(); // Actually parse the document and then produce the output event // NB - input here in real code is a ByteArrayInputStream as these documents are coming to our code via a Kafka topic AutoDetectParser parser = new AutoDetectParser(); parser.parse(input, handler, tikaMetadata, context); {noformat} I have attached the example {{test.pdf}} to this ticket. Note that this bug happens with all files types, including things like plain text. The "fix" for using 3.2.2 seems to be to change how we set the Tika metadata to instead use {{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE}}. However if the file isn't successfully detected as an archive I would expect Tika to fallback to trying other content detectors rather than bailing out early, as this was the behaviour prior to 3.2.2 and tests with this file, and other files, were working fine prior to 3.2.2. I suspect this bug is most likely related to the fix for TIKA-4424 -- This message was sent by Atlassian Jira (v8.20.10#820010)