[ 
https://issues.apache.org/jira/browse/TIKA-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444644#comment-17444644
 ] 

Nick Burch commented on TIKA-3590:
----------------------------------

[~salmira] Are you able to create us a few sample dmg files to test with? 
Ideally with our standard set of contents for compressed / package formats, eg 
test-documents.zip or test-documents.tar from 
[https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/test/resources/test-documents]

[~tallison] I don't think our current structure will allow us to have one mime 
type with multiple parents, such as this where there is one "format" that is 
actually many different formats all sharing the same official mime type and 
extension. We could potentially do something nasty, and have subtypes which 
reproduce the zlib and bzip2 magics along with some sort of compressed DMG 
header, but it'd be tricky. We can't just do standard zlib or bzip2 magic, 
otherwise it'll trump the real formats, so would need to be the compressed 
outer layer magic _plus_ some sort of inner magic too. Unless we did something 
horribly evil, and had {{application/x-apple-diskimage; compression=zlib}} 
which defined a parent of zlib and not officially 
{{application/x-apple-diskimage}} - not sure if the parent and parameter 
matching would let us get away with that.... double magic would be 
safer/cleaner, do need some test files to check it all with!

> OSX DMG files wrong MIME type detection (wrong MediaType and Supertype)
> -----------------------------------------------------------------------
>
>                 Key: TIKA-3590
>                 URL: https://issues.apache.org/jira/browse/TIKA-3590
>             Project: Tika
>          Issue Type: Bug
>          Components: core, detector
>    Affects Versions: 1.26, 1.27, 2.0.0-ALPHA, 2.0.0-BETA, 2.1.0
>            Reporter: Tetiana Tvardovska
>            Priority: Major
>
> Calling {{mimeSupport.detectMimeTypes}} for  OSX DMG files returns a wrong 
> value.
> DMG files are detected as MIME type: {{*"application/zlib"*}} or 
> *{{"application/x-bzip"}}*
> instead of expected: *{{"application/x-apple-diskimage".}}*
>  
> Error is caused by {{getSupertype}} method which returns a wrong type (too 
> "super" {{{}MediaType.OCTET_STREAM){}}}for OSX DMG files instead of  
> {{{}*"application/zlib" or* {*}"application/x-bzip"{*}{*}{*}{}}}.
>  
> For information, DMG mime type is correctly detected when debugging the  
> method
>  
> {code:java}
> org/apache/tika/mime/MimeTypes.java:484  public MediaType detect(...
> 522:  MimeType hint = getMimeType(name); 
> {code}
>   the {{hint}} value gets a correct *{{"application/x-apple-diskimage"}}* 
> value here.
> But later the {{hint}} value is not taken into consideration for 
> {{possibleTypes}}  as {{applyHint}} results:
>  
> {code:java}
> 529:  possibleTypes = applyHint(possibleTypes, hint);{code}
>  
> This wrong value is returned to : 
>  
> {code:java}
> repository/org/apache/tika/tika-core/1.26/tika-core-1.26-sources.jar!/org/apache/tika/detect/CompositeDetector.java:84
> MediaType detected = detector.detect(input, metadata);
> if (registry.isSpecializationOf(detected, type)) {
> type = detected;
> }
> {code}
>  
>  
> h3. Possible solution -Add a more precise Supertype detection for 
> "{{{}*application/x-apple-diskimage*{}}}" type
> Just add one more verification into the 
> {{{}MediaTypeRegistry.{}}}{{getSupertype}} method, for example, in a 
> 'diff'-like format:
> {{org/apache/tika/tika-core/1.26/tika-core-1.26-sources.jar}}
> {{org/apache/tika/mime/MediaTypeRegistry.java:187}}
>  
> {code:java}
> public MediaType getSupertype(MediaType type) {
>  ...
> +    } else if (type.getSubtype().endsWith("x-apple-diskimage")) { 
> +        return    MediaType.application("x-bzip");
> +    }
> ...
> }
> {code}
>  
> or
> {code:java}
> public MediaType getSupertype(MediaType type) {
>  ...
> +    } else if (type.getSubtype().endsWith("x-apple-diskimage")) { 
> +        return MediaType.APPLICATION_ZIP;
> +    }
> ...
> }
> {code}
>  
>  
> ---
> Tested at project [Sonatype Nexus|https://github.com/sonatype/nexus-public/] 
> {{release-3.36.0-01 }}for RAW repository with a "Strict Content Type 
> Validation" set ON when trying to upload *.dmg files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to