[ 
https://issues.apache.org/jira/browse/NUTCH-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898712#comment-17898712
 ] 

Sebastian Nagel commented on NUTCH-3089:
----------------------------------------

[~hiranchaudhuri] Thanks! This is actually a good idea. Yes, it requires some 
implementation work and some significant changes. But it might be worth it. 
[~tallison], moving the MIME detection to a plugin would reduce its burden in 
terms of a huge dependency tree. I've opened NUTCH-3090 to discuss this idea. 
Let's keep this issue for how the MIME detection is implemented in Nutch, resp. 
how the Tika methods are called. 

> Review MIME type detection
> --------------------------
>
>                 Key: NUTCH-3089
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3089
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol, util
>    Affects Versions: 1.20
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.21
>
>
> The MIME detection in {{o.a.n.util.MimeUtil#autoResolveContentType}} needs a 
> review:
> - the fall-back to use the Content-Type HTTP header, only moderately cleaned, 
> leads to strange-looking and obviously misspelled resp. invalid MIME types: 
> "application/.octet-stream", "application/."
>   - note: this issue stems from a [discussion on the Common Crawl user 
> group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ]. 
> More examples are given there.
>   - Tika's method 
> [MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)]
>  used in [MimeUtil.java, line 
> 162|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L162]
>  does only a limited validation, not sufficient to filter out the above 
> mentioned erroneous MIME types.
> - performance:  if the property {{mime.type.magic}} == true, Tika's magic 
> detector is called with the binary content and the URL (which includes the 
> file suffix) and the Content-Type HTTP header as additional hints to support 
> the detection. Tika's detect method uses similar fall-back heuristics, 
> calling also {{MimeTypes#forName}}. Relying only on Tika's detect method if 
> {{mime.type.magic}} == true, should save computation time, and eventually 
> leads to more precise results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to