[ 
https://issues.apache.org/jira/browse/NUTCH-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1991:
-----------------------------------
    Attachment: NUTCH-1991-trunk.v2.patch

Thanks, [~ilopata1]! Updated patch to apply against trunk - only the core 
remains (use mimeTypes.detect() instead of tika.detect(). Tested: 
tika-mimetypes.xml is loaded from $NUTCH_HOME/conf/ if property mime.types.file 
is set.

> Tika mime detection not using Nutch supplied tika-mimetypes.xml for content 
> based detection
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1991
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1991
>             Project: Nutch
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1, 1.10, 1.11, 2.3.1
>            Reporter: Iain Lopata
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NUTCH-1991-1.6.patch, NUTCH-1991-trunk.v2.patch
>
>
> From Nutch Version 1.5 onwards the MimeUtil.java class that acts as a facade 
> to Tika to perform mime type detection uses a process that attempts a match 
> using the mimetype returned by the server, the filename and the content. 
> NUTCH-1045 provided for the use of an external tika-mimetype.xml file which 
> provides the configuration for this process.  However, the content based 
> detection did not use this file, but instead reverted to using the 
> configuration included in the tika library.  Consequently, any content based 
> match rules added to the nutch version of the configuration file were not 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to