Rob Vesse created TIKA-4469:
-------------------------------

             Summary: After upgrading to 3.2.2 most files are incorrectly 
treated as Archive's by AutoDetectParser
                 Key: TIKA-4469
                 URL: https://issues.apache.org/jira/browse/TIKA-4469
             Project: Tika
          Issue Type: Bug
          Components: detector
    Affects Versions: 3.2.2
            Reporter: Rob Vesse
         Attachments: test.pdf

We had an application that was working fine with 3.2.1, after Dependabot 
suggested an upgrade to 3.2.2 the builds for that PR were failing.  On 
investigation it was found that with 3.2.2 Tika the {{AutoDetectParser}} seems 
to treat every file as potentially being an archive file and then fails because 
it actually isn't:


{noformat}
Caused by: org.apache.commons.compress.archivers.ArchiveException: No Archiver 
found for the stream signature
        at 
org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(ArchiveStreamFactory.java:295)
        at 
org.apache.tika.detect.zip.DefaultZipContainerDetector.detectArchiveFormat(DefaultZipContainerDetector.java:122)
        at 
org.apache.tika.detect.zip.DefaultZipContainerDetector.detect(DefaultZipContainerDetector.java:180)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:179)
{noformat}

Code is pretty straightforward (simplified to take out some application 
implementation detail):

{noformat}
Metadata tikaMetadata = new Metadata();
tikaMetadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.pdf");
tikaMetadata.set("Content-Type", "application/pdf");

// Use a BodyContentHandler as we just want the textual output
BodyContentHandler handler = new BodyContentHandler(-1);

// Prepare a Tika parse context
ParseContext context = new ParseContext();

// Actually parse the document and then produce the output event
// NB - input here in real code is a ByteArrayInputStream as these documents 
are coming to our code via a Kafka topic
AutoDetectParser parser = new AutoDetectParser();
parser.parse(input, handler, tikaMetadata, context);
{noformat}

I have attached the example {{test.pdf}} to this ticket.  Note that this bug 
happens with all files types, including things like plain text.

The "fix" for using 3.2.2 seems to be to change how we set the Tika metadata to 
instead use {{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE}}.

However if the file isn't successfully detected as an archive I would expect 
Tika to fallback to trying other content detectors rather than bailing out 
early, as this was the behaviour prior to 3.2.2 and tests with this file, and 
other files, were working fine prior to 3.2.2.

I suspect this bug is most likely related to the fix for TIKA-4424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to