[ https://issues.apache.org/jira/browse/NUTCH-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898303#comment-17898303 ]
Tim Allison commented on NUTCH-3089: ------------------------------------ Thank you [~snagel]! If {{mime.type.magic}} doesn't already exist, I'm wondering if we can modify it so something like {{mime.type.detect}}. The (admittedly pedantic) difference is that Tika uses magic but also has container detection (if tika-parsers-standard is on the class path). > Review MIME type detection > -------------------------- > > Key: NUTCH-3089 > URL: https://issues.apache.org/jira/browse/NUTCH-3089 > Project: Nutch > Issue Type: Improvement > Components: protocol, util > Affects Versions: 1.20 > Reporter: Sebastian Nagel > Priority: Major > Fix For: 1.21 > > > The MIME detection in {{o.a.n.util.MimeUtil#autoResolveContentType}} needs a > review: > - the fall-back to use the Content-Type HTTP header, only moderately cleaned, > leads to strange-looking and obviously misspelled resp. invalid MIME types: > "application/.octet-stream", "application/." > - note: this issue stems from a [discussion on the Common Crawl user > group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ]. > More examples are given there. > - Tika's method > [MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)] > used in [MimeUtil.java, line > 162|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L162] > does only a limited validation, not sufficient to filter out the above > mentioned erroneous MIME types. > - performance: if the property {{mime.type.magic}} == true, Tika's magic > detector is called with the binary content and the URL (which includes the > file suffix) and the Content-Type HTTP header as additional hints to support > the detection. Tika's detect method uses similar fall-back heuristics, > calling also {{MimeTypes#forName}}. Relying only on Tika's detect method if > {{mime.type.magic}} == true, should save computation time, and eventually > leads to more precise results. -- This message was sent by Atlassian Jira (v8.20.10#820010)