Sebastian Nagel created NUTCH-3089:
--------------------------------------
Summary: Review MIME type detection
Key: NUTCH-3089
URL: https://issues.apache.org/jira/browse/NUTCH-3089
Project: Nutch
Issue Type: Improvement
Components: protocol, util
Affects Versions: 1.20
Reporter: Sebastian Nagel
Fix For: 1.21
The MIME detection in {{o.a.n.util.MimeUtil#autoResolveContentType}} needs a
review:
- the fall-back to use the Content-Type HTTP header, only moderately cleaned,
leads to strange-looking and obviously misspelled resp. invalid MIME types:
"application/.octet-stream", "application/."
- note: this issue stems from a [discussion on the Common Crawl user
group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ].
More examples are given there.
- Tika's method
[MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)]
used in [MimeUtil.java, line
162|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L162]
does only a limited validation, not sufficient to filter out the above
mentioned erroneous MIME types.
- performance: if the property {{mime.type.magic}} == true, Tika's magic
detector is called with the binary content and the URL (which includes the file
suffix) and the Content-Type HTTP header as additional hints to support the
detection. Tika's detect method uses similar fall-back heuristics, calling also
{{MimeTypes#forName}}. Relying only on Tika's detect method if
{{mime.type.magic}} == true, should save computation time, and eventually leads
to more precise results.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)