[
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1259:
---------------------------------
Attachment: NUTCH-1259-1.5-1.patch
Here's a patch for 1.5. Comments? We have this running in production and it
does works very good. It completely solves the big problem of ending up with
many thousands of crap content-types.
I'll commit this one tomorrow unless there are objections.
> TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
> --------------------------------------------------------------------------
>
> Key: NUTCH-1259
> URL: https://issues.apache.org/jira/browse/NUTCH-1259
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1259-1.5-1.patch
>
>
> The MIME-type detected by Tika's Detect() API is never added to a Parse's
> ContentMetaData or ParseMetaData. Because of this bad Content-Types will end
> up in the documents.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira