[ 
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067773#comment-13067773
 ] 

Julien Nioche commented on NUTCH-1045:
--------------------------------------

you should see a message in the logs at the beginning of a task saying 
something like 

{quote}
 LOG.error("Can't load mime.types.file : tika-mimetypes.xml using Tika's 
default"); 
{quote}

and getting the right amount of mime-type counts (although I am not sure that 
we are currently reporting these in 1.4). 

The problem is that in most cases you'll get the mime-type guessed because of 
the info returned by the server, not because of Tika's detection. The best way 
of making sure that Tika successfully relies on the default setting for 
guessing would be to add a LOG entry on line 175 in MimeUtil with the Mimetype 
found.

BTW this class is in serious need of refactoring as the underlying Tika API has 
changed a lot. The logic around what strategies to use e.g. trust the metadata 
returned by the server? trust Tika's detection? etc... should be reimplemented 
using the Detector implementations. Will open a new JIRA for this



> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is 
> absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though 
> but if the user hasn't specified one or if it can't be loaded then we should 
> rely on Tika's default. This way we won't need to provide 
> conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one 
> whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to