[
https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Burch resolved TIKA-1292.
------------------------------
Resolution: Fixed
Fix Version/s: 1.6
> Inconsistent priorities in bundled tika-mimetypes.xml
> -----------------------------------------------------
>
> Key: TIKA-1292
> URL: https://issues.apache.org/jira/browse/TIKA-1292
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 1.5
> Reporter: Cservenak, Tamas
> Fix For: 1.6
>
>
> It seems that mime-type priorities are a bit inconsistent in the tika-core
> bundled tika-mimetypes.xml
> Few examples:
> *
> [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
> vs
> [application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]:
> both are similar "containers" archive formats (structured, having entries),
> having distinct file extensions ("zip" vs "7z" globs), still priorities are
> 40 and 50 respectively.
> *
> [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
> vs
> [text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]:
> not quite related MIME types, having same priority of 40. But ZIP files can
> be "uncompressed" (meaning entries are mostly "concatenated", and their
> content, if plaintext, is readable). Hence, having an "uncompressed" ZIP (or
> any subclass like JAR) file that contains HTML files zipped up might/will be
> detected as HTML, which is wrong.
> And this is what happens in Nexus that uses Tika under the hud for "content"
> validation, basically using MIME magic detection provided by Tika Detector:
> the Java JAR {{com.intellij:annotations:7.0.3}}
> ([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is
> being detected as {{text/html}} instead of (expected)
> {{application/java-archive}}.
> Reason is following: the JAR file is zipped up in "uncompressed" zip format,
> and among few annotations it also contains one HTML file entry (the license I
> guess). Since both MIME types have same priority (40), I guess tika
> "randomly" chooses the {{text/html}}.
> Original Nexus issue
> https://issues.sonatype.org/browse/NEXUS-6560
> At Nexus issue there is a GH Pull Request that solves the problem for us (by
> raising {{application/zip}} priority to 41.
> But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably
> -- priority inconsistencies, like that of zip vs 7z mentioned above.
> Note: this happens when using tika-core solely on classpath and using it for
> MIME magic detection. Interestingly, when the tika-parsers (with it's all
> dependencies) are added to classpath, Tika will properly figure out that the
> artifact is {{application/java-archive}}. Still, our use case in Nexus
> requires the MIME magic detection only, so we do not use tika-parsers, nor we
> would like to do so.
> Sample project to reproduce
> https://github.com/cstamas/tika-1292
--
This message was sent by Atlassian JIRA
(v6.2#6252)