[ https://issues.apache.org/jira/browse/TIKA-4469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18015090#comment-18015090 ]
Tilman Hausherr commented on TIKA-4469: --------------------------------------- It works for me, here's the actual code I used: {code:java} InputStream is = URI.create("https://issues.apache.org/jira/secure/attachment/13078051/test.pdf").toURL().openStream(); is = new ByteArrayInputStream(is.readAllBytes()); Metadata tikaMetadata = new Metadata(); tikaMetadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.pdf"); tikaMetadata.set("Content-Type", "application/pdf"); // Use a BodyContentHandler as we just want the textual output BodyContentHandler handler = new BodyContentHandler(-1); // Prepare a Tika parse context ParseContext context = new ParseContext(); // Actually parse the document and then produce the output event // NB - input here in real code is a ByteArrayInputStream as these documents are coming to our code via a Kafka topic AutoDetectParser parser = new AutoDetectParser(); parser.parse(is, handler, tikaMetadata, context); {code} > After upgrading to 3.2.2 most files are incorrectly treated as Archive's by > AutoDetectParser > -------------------------------------------------------------------------------------------- > > Key: TIKA-4469 > URL: https://issues.apache.org/jira/browse/TIKA-4469 > Project: Tika > Issue Type: Bug > Components: detector > Affects Versions: 3.2.2 > Reporter: Rob Vesse > Priority: Major > Attachments: test.pdf > > > We had an application that was working fine with 3.2.1, after Dependabot > suggested an upgrade to 3.2.2 the builds for that PR were failing. On > investigation it was found that with 3.2.2 Tika the {{AutoDetectParser}} > seems to treat every file as potentially being an archive file and then fails > because it actually isn't: > {noformat} > Caused by: org.apache.commons.compress.archivers.ArchiveException: No > Archiver found for the stream signature > at > org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(ArchiveStreamFactory.java:295) > at > org.apache.tika.detect.zip.DefaultZipContainerDetector.detectArchiveFormat(DefaultZipContainerDetector.java:122) > at > org.apache.tika.detect.zip.DefaultZipContainerDetector.detect(DefaultZipContainerDetector.java:180) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:179) > {noformat} > Code is pretty straightforward (simplified to take out some application > implementation detail): > {noformat} > Metadata tikaMetadata = new Metadata(); > tikaMetadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.pdf"); > tikaMetadata.set("Content-Type", "application/pdf"); > // Use a BodyContentHandler as we just want the textual output > BodyContentHandler handler = new BodyContentHandler(-1); > // Prepare a Tika parse context > ParseContext context = new ParseContext(); > // Actually parse the document and then produce the output event > // NB - input here in real code is a ByteArrayInputStream as these documents > are coming to our code via a Kafka topic > AutoDetectParser parser = new AutoDetectParser(); > parser.parse(input, handler, tikaMetadata, context); > {noformat} > I have attached the example {{test.pdf}} to this ticket. Note that this bug > happens with all files types, including things like plain text. > The "fix" for using 3.2.2 seems to be to change how we set the Tika metadata > to instead use {{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE}}. > However if the file isn't successfully detected as an archive I would expect > Tika to fallback to trying other content detectors rather than bailing out > early, as this was the behaviour prior to 3.2.2 and tests with this file, and > other files, were working fine prior to 3.2.2. > I suspect this bug is most likely related to the fix for TIKA-4424 -- This message was sent by Atlassian Jira (v8.20.10#820010)