[
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204463#comment-13204463
]
Julien Nioche commented on NUTCH-1259:
--------------------------------------
bq. // DO NOT ADD Content-Type FROM HTTP_HEADERS, ONLY ADD THE DETECTED TYPE
SEE https://issues.apache.org/jira/browse/NUTCH-1259
hmmm, isn't that the content-type from the HTML headers instead?
Anyway, probably a good idea NOT to add it to the parse-metadata as it has
already been detected from the content and stored in the content metadata,
however I can't think of a reason why we'd want to duplicate that to the parse
metadata as well. The value in the content metadata is the one set by the
detector and should be the correct one. Or am I missing something?
> TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
> --------------------------------------------------------------------------
>
> Key: NUTCH-1259
> URL: https://issues.apache.org/jira/browse/NUTCH-1259
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1259-1.5-1.patch
>
>
> The MIME-type detected by Tika's Detect() API is never added to a Parse's
> ContentMetaData or ParseMetaData. Because of this bad Content-Types will end
> up in the documents.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira