[
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067773#comment-13067773
]
Julien Nioche commented on NUTCH-1045:
--------------------------------------
you should see a message in the logs at the beginning of a task saying
something like
{quote}
LOG.error("Can't load mime.types.file : tika-mimetypes.xml using Tika's
default");
{quote}
and getting the right amount of mime-type counts (although I am not sure that
we are currently reporting these in 1.4).
The problem is that in most cases you'll get the mime-type guessed because of
the info returned by the server, not because of Tika's detection. The best way
of making sure that Tika successfully relies on the default setting for
guessing would be to add a LOG entry on line 175 in MimeUtil with the Mimetype
found.
BTW this class is in serious need of refactoring as the underlying Tika API has
changed a lot. The logic around what strategies to use e.g. trust the metadata
returned by the server? trust Tika's detection? etc... should be reimplemented
using the Detector implementations. Will open a new JIRA for this
> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
> Key: NUTCH-1045
> URL: https://issues.apache.org/jira/browse/NUTCH-1045
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Julien Nioche
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is
> absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though
> but if the user hasn't specified one or if it can't be loaded then we should
> rely on Tika's default. This way we won't need to provide
> conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one
> whenever we upgrade Tika.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira