[ 
https://issues.apache.org/jira/browse/TIKA-4469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18015090#comment-18015090
 ] 

Tilman Hausherr commented on TIKA-4469:
---------------------------------------

It works for me, here's the actual code I used:
{code:java}
        InputStream is = 
URI.create("https://issues.apache.org/jira/secure/attachment/13078051/test.pdf";).toURL().openStream();
        is = new ByteArrayInputStream(is.readAllBytes());

        Metadata tikaMetadata = new Metadata();
        tikaMetadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.pdf");
        tikaMetadata.set("Content-Type", "application/pdf");

        // Use a BodyContentHandler as we just want the textual output
        BodyContentHandler handler = new BodyContentHandler(-1);

        // Prepare a Tika parse context
        ParseContext context = new ParseContext();

        // Actually parse the document and then produce the output event
        // NB - input here in real code is a ByteArrayInputStream as these 
documents are coming to our code via a Kafka topic
        AutoDetectParser parser = new AutoDetectParser();
        parser.parse(is, handler, tikaMetadata, context);
{code}

> After upgrading to 3.2.2 most files are incorrectly treated as Archive's by 
> AutoDetectParser
> --------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4469
>                 URL: https://issues.apache.org/jira/browse/TIKA-4469
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 3.2.2
>            Reporter: Rob Vesse
>            Priority: Major
>         Attachments: test.pdf
>
>
> We had an application that was working fine with 3.2.1, after Dependabot 
> suggested an upgrade to 3.2.2 the builds for that PR were failing.  On 
> investigation it was found that with 3.2.2 Tika the {{AutoDetectParser}} 
> seems to treat every file as potentially being an archive file and then fails 
> because it actually isn't:
> {noformat}
> Caused by: org.apache.commons.compress.archivers.ArchiveException: No 
> Archiver found for the stream signature
>       at 
> org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(ArchiveStreamFactory.java:295)
>       at 
> org.apache.tika.detect.zip.DefaultZipContainerDetector.detectArchiveFormat(DefaultZipContainerDetector.java:122)
>       at 
> org.apache.tika.detect.zip.DefaultZipContainerDetector.detect(DefaultZipContainerDetector.java:180)
>       at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:179)
> {noformat}
> Code is pretty straightforward (simplified to take out some application 
> implementation detail):
> {noformat}
> Metadata tikaMetadata = new Metadata();
> tikaMetadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.pdf");
> tikaMetadata.set("Content-Type", "application/pdf");
> // Use a BodyContentHandler as we just want the textual output
> BodyContentHandler handler = new BodyContentHandler(-1);
> // Prepare a Tika parse context
> ParseContext context = new ParseContext();
> // Actually parse the document and then produce the output event
> // NB - input here in real code is a ByteArrayInputStream as these documents 
> are coming to our code via a Kafka topic
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(input, handler, tikaMetadata, context);
> {noformat}
> I have attached the example {{test.pdf}} to this ticket.  Note that this bug 
> happens with all files types, including things like plain text.
> The "fix" for using 3.2.2 seems to be to change how we set the Tika metadata 
> to instead use {{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE}}.
> However if the file isn't successfully detected as an archive I would expect 
> Tika to fallback to trying other content detectors rather than bailing out 
> early, as this was the behaviour prior to 3.2.2 and tests with this file, and 
> other files, were working fine prior to 3.2.2.
> I suspect this bug is most likely related to the fix for TIKA-4424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to