[
https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196843#comment-13196843
]
Markus Jelsma commented on NUTCH-1262:
--------------------------------------
I've looked through the API and sourcecode but it doesn't seem to be there as
we need it. It does provide API's to return aliases but the example types are
not considered the aliasses in Tika judging from the tike-mimetypes.xml
resource file.
This simple mapper also allows users to map content types to a human friendly
alias.
> Map `duplicating` content-types to a single type
> ------------------------------------------------
>
> Key: NUTCH-1262
> URL: https://issues.apache.org/jira/browse/NUTCH-1262
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1262-1.5-1.patch
>
>
> Similar or duplicating content-types can end-up differently in an index.
> With, for example, both application/xhtml+xml and text/html it is impossible
> to use a single filter to select `web pages`.
> See also:
> http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira