[ 
https://issues.apache.org/jira/browse/TIKA-4469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18015126#comment-18015126
 ] 

Tilman Hausherr commented on TIKA-4469:
---------------------------------------

See also here: 
https://github.com/quarkiverse/quarkus-tika/pull/230

> After upgrading to 3.2.2 most files are incorrectly treated as Archive's by 
> AutoDetectParser
> --------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4469
>                 URL: https://issues.apache.org/jira/browse/TIKA-4469
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 3.2.2
>            Reporter: Rob Vesse
>            Priority: Major
>         Attachments: test.pdf
>
>
> We had an application that was working fine with 3.2.1, after Dependabot 
> suggested an upgrade to 3.2.2 the builds for that PR were failing.  On 
> investigation it was found that with 3.2.2 Tika the {{AutoDetectParser}} 
> seems to treat every file as potentially being an archive file and then fails 
> because it actually isn't:
> {noformat}
> Caused by: org.apache.commons.compress.archivers.ArchiveException: No 
> Archiver found for the stream signature
>       at 
> org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(ArchiveStreamFactory.java:295)
>       at 
> org.apache.tika.detect.zip.DefaultZipContainerDetector.detectArchiveFormat(DefaultZipContainerDetector.java:122)
>       at 
> org.apache.tika.detect.zip.DefaultZipContainerDetector.detect(DefaultZipContainerDetector.java:180)
>       at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:179)
> {noformat}
> Code is pretty straightforward (simplified to take out some application 
> implementation detail):
> {noformat}
> Metadata tikaMetadata = new Metadata();
> tikaMetadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.pdf");
> tikaMetadata.set("Content-Type", "application/pdf");
> // Use a BodyContentHandler as we just want the textual output
> BodyContentHandler handler = new BodyContentHandler(-1);
> // Prepare a Tika parse context
> ParseContext context = new ParseContext();
> // Actually parse the document and then produce the output event
> // NB - input here in real code is a ByteArrayInputStream as these documents 
> are coming to our code via a Kafka topic
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(input, handler, tikaMetadata, context);
> {noformat}
> I have attached the example {{test.pdf}} to this ticket.  Note that this bug 
> happens with all files types, including things like plain text.
> The "fix" for using 3.2.2 seems to be to change how we set the Tika metadata 
> to instead use {{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE}}.
> However if the file isn't successfully detected as an archive I would expect 
> Tika to fallback to trying other content detectors rather than bailing out 
> early, as this was the behaviour prior to 3.2.2 and tests with this file, and 
> other files, were working fine prior to 3.2.2.
> I suspect this bug is most likely related to the fix for TIKA-4424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to