[
https://issues.apache.org/jira/browse/TIKA-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444644#comment-17444644
]
Nick Burch commented on TIKA-3590:
----------------------------------
[~salmira] Are you able to create us a few sample dmg files to test with?
Ideally with our standard set of contents for compressed / package formats, eg
test-documents.zip or test-documents.tar from
[https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/test/resources/test-documents]
[~tallison] I don't think our current structure will allow us to have one mime
type with multiple parents, such as this where there is one "format" that is
actually many different formats all sharing the same official mime type and
extension. We could potentially do something nasty, and have subtypes which
reproduce the zlib and bzip2 magics along with some sort of compressed DMG
header, but it'd be tricky. We can't just do standard zlib or bzip2 magic,
otherwise it'll trump the real formats, so would need to be the compressed
outer layer magic _plus_ some sort of inner magic too. Unless we did something
horribly evil, and had {{application/x-apple-diskimage; compression=zlib}}
which defined a parent of zlib and not officially
{{application/x-apple-diskimage}} - not sure if the parent and parameter
matching would let us get away with that.... double magic would be
safer/cleaner, do need some test files to check it all with!
> OSX DMG files wrong MIME type detection (wrong MediaType and Supertype)
> -----------------------------------------------------------------------
>
> Key: TIKA-3590
> URL: https://issues.apache.org/jira/browse/TIKA-3590
> Project: Tika
> Issue Type: Bug
> Components: core, detector
> Affects Versions: 1.26, 1.27, 2.0.0-ALPHA, 2.0.0-BETA, 2.1.0
> Reporter: Tetiana Tvardovska
> Priority: Major
>
> Calling {{mimeSupport.detectMimeTypes}} for OSX DMG files returns a wrong
> value.
> DMG files are detected as MIME type: {{*"application/zlib"*}} or
> *{{"application/x-bzip"}}*
> instead of expected: *{{"application/x-apple-diskimage".}}*
>
> Error is caused by {{getSupertype}} method which returns a wrong type (too
> "super" {{{}MediaType.OCTET_STREAM){}}}for OSX DMG files instead of
> {{{}*"application/zlib" or* {*}"application/x-bzip"{*}{*}{*}{}}}.
>
> For information, DMG mime type is correctly detected when debugging the
> method
>
> {code:java}
> org/apache/tika/mime/MimeTypes.java:484 public MediaType detect(...
> 522: MimeType hint = getMimeType(name);
> {code}
> the {{hint}} value gets a correct *{{"application/x-apple-diskimage"}}*
> value here.
> But later the {{hint}} value is not taken into consideration for
> {{possibleTypes}} as {{applyHint}} results:
>
> {code:java}
> 529: possibleTypes = applyHint(possibleTypes, hint);{code}
>
> This wrong value is returned to :
>
> {code:java}
> repository/org/apache/tika/tika-core/1.26/tika-core-1.26-sources.jar!/org/apache/tika/detect/CompositeDetector.java:84
> MediaType detected = detector.detect(input, metadata);
> if (registry.isSpecializationOf(detected, type)) {
> type = detected;
> }
> {code}
>
>
> h3. Possible solution -Add a more precise Supertype detection for
> "{{{}*application/x-apple-diskimage*{}}}" type
> Just add one more verification into the
> {{{}MediaTypeRegistry.{}}}{{getSupertype}} method, for example, in a
> 'diff'-like format:
> {{org/apache/tika/tika-core/1.26/tika-core-1.26-sources.jar}}
> {{org/apache/tika/mime/MediaTypeRegistry.java:187}}
>
> {code:java}
> public MediaType getSupertype(MediaType type) {
> ...
> + } else if (type.getSubtype().endsWith("x-apple-diskimage")) {
> + return MediaType.application("x-bzip");
> + }
> ...
> }
> {code}
>
> or
> {code:java}
> public MediaType getSupertype(MediaType type) {
> ...
> + } else if (type.getSubtype().endsWith("x-apple-diskimage")) {
> + return MediaType.APPLICATION_ZIP;
> + }
> ...
> }
> {code}
>
>
> ---
> Tested at project [Sonatype Nexus|https://github.com/sonatype/nexus-public/]
> {{release-3.36.0-01 }}for RAW repository with a "Strict Content Type
> Validation" set ON when trying to upload *.dmg files.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)