Simon Gaeremynck created TIKA-3556:
--------------------------------------
Summary: DefaultZipContainerDetector returns application/zip for
.odt files when OPCPackageDetector is on the classpath
Key: TIKA-3556
URL: https://issues.apache.org/jira/browse/TIKA-3556
Project: Tika
Issue Type: Bug
Components: detector
Affects Versions: 2.1.0
Reporter: Simon Gaeremynck
This is happening because the OPCPackageDetector.detect method will [fail and
close the underlying zip
stream|https://github.com/apache/tika/blob/2.1.0-rc2/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java#L257].
When the next detector runs (e.g. OpenDocumentDetector), the stream it
receives has been closed and it won't be able to detect anything.
After all detectors have effectively no-oped, [the DefaultZipContainerDetector
falls back to
application/zip|https://github.com/apache/tika/blob/2.1.0-rc2/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L209].
Now, when running with the default CompositeDetector, the next detector is
usually the MimeTypes detector. This returns the proper
application/vnd.oasis.opendocument.text, but the [CompositeDetector will
ignore|https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java#L86]
it as that mime type isn't marked up as a subclass of application/zip in [the
registry|https://github.com/apache/tika/blob/2.1.0-rc2/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L2327].
In short, I think there are two bugs here potentially:
# The OPCPacakageDetector either shouldn't close the zip while detecting or
the DefaultZipContainerDetector should re-open if necessary?
# The registry should be updated to mark up
application/vnd.oasis.opendocument.text as a subclass of application/zip ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)