[
https://issues.apache.org/jira/browse/NUTCH-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512691#comment-14512691
]
Chris A. Mattmann commented on NUTCH-1991:
------------------------------------------
So, the problem here is that tika.detect() uses the DefaultDetector which is a
CompositeDetector strategy for resolving MimeTypes. To change the call to
mimeTypes.detect, we need to replicate the same thing that happens in
tika.detect, which is not optimal. I suggest we simply leave the tika.detect
call, but instead if a custom mimeTypes file is provided, we build a new Tika
facade object with the path to the overriden mimeTypes file. I'll work towards
this shortly.
(BTW TestZipParser fails with this patch which is the reason the build broke -
that's b/c the custom mime types file is null and it now returns the call to
mimeTypes.detect as text/plain instead of application/x-zip).
> Tika mime detection not using Nutch supplied tika-mimetypes.xml for content
> based detection
> -------------------------------------------------------------------------------------------
>
> Key: NUTCH-1991
> URL: https://issues.apache.org/jira/browse/NUTCH-1991
> Project: Nutch
> Issue Type: Bug
> Components: util
> Affects Versions: 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1, 1.10, 1.11, 2.3.1
> Reporter: Iain Lopata
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.10
>
> Attachments: NUTCH-1991-1.6.patch, NUTCH-1991-trunk.v2.patch
>
>
> From Nutch Version 1.5 onwards the MimeUtil.java class that acts as a facade
> to Tika to perform mime type detection uses a process that attempts a match
> using the mimetype returned by the server, the filename and the content.
> NUTCH-1045 provided for the use of an external tika-mimetype.xml file which
> provides the configuration for this process. However, the content based
> detection did not use this file, but instead reverted to using the
> configuration included in the tika library. Consequently, any content based
> match rules added to the nutch version of the configuration file were not
> used.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)