[ 
https://issues.apache.org/jira/browse/TIKA-3556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415531#comment-17415531
 ] 

Tim Allison commented on TIKA-3556:
-----------------------------------

Wait, no, I was testing a bad odt file.  

I agree with your point about the bugs.  I'm trying to write a unit test to 
test for this, and I'm seeing the zip detectors (in a unit test in 
tika-parsers-standard-package) in this order:

IWorkDetector
IPADetector
JarDetector
KMZDetector
OpenDocumentDetector
StarOfficeDetector
OPCPackageDetector

When run in this order the OpenDocumentDetector correctly identifies the file 
type,  and the ones after it are not called.

To confirm, your detectors are in a different order (with OPCPackageDetector 
coming before OpenDocumentDetector)?

I'm wondering if we should sort them in the above order if they're available in 
addition to fixing the other bugs?

I can write a unit test that uses a custom config to specify the order of the 
zip detectors, but can you share more info about what's on your classpath/which 
packages you're using/how you're building your detector?

Thank you, again.

> DefaultZipContainerDetector returns application/zip for .odt files when 
> OPCPackageDetector is on the classpath
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3556
>                 URL: https://issues.apache.org/jira/browse/TIKA-3556
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 2.1.0
>            Reporter: Simon Gaeremynck
>            Priority: Major
>
> This is happening because the OPCPackageDetector.detect method will [fail and 
> close the underlying zip 
> stream|https://github.com/apache/tika/blob/2.1.0-rc2/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java#L257].
>  When the next detector runs (e.g. OpenDocumentDetector), the stream it 
> receives has been closed and it won't be able to detect anything.
> After all detectors have effectively no-oped, [the 
> DefaultZipContainerDetector falls back to 
> application/zip|https://github.com/apache/tika/blob/2.1.0-rc2/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L209].
> Now, when running with the default CompositeDetector, the next detector is 
> usually the MimeTypes detector. This returns the proper 
> application/vnd.oasis.opendocument.text, but the [CompositeDetector will 
> ignore|https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java#L86]
>  it as that mime type isn't marked up as a subclass of application/zip in 
> [the 
> registry|https://github.com/apache/tika/blob/2.1.0-rc2/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L2327].
>  
> In short, I think there are two bugs here potentially:
>  # The OPCPacakageDetector either shouldn't close the zip while detecting or 
> the DefaultZipContainerDetector should re-open if necessary?
>  # The registry should be updated to mark up 
> application/vnd.oasis.opendocument.text as a subclass of application/zip ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to