[ 
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067794#comment-13067794
 ] 

Julien Nioche commented on NUTCH-1045:
--------------------------------------

{quote}
May be because the empty file is still included in the job file?
{quote}

Do you mean that the job file contains an empty tika-mimetypes.xml? Would you 
mind running the parsing again after it has been removed + add a debug line on 
175 to check that the Tika detection is done?

{quote}
i'm a big proponent of detection and never trusting meta tags or headers 
returned.
{quote}

Having the option to choose which strategy to adopt would be better, a bit like 
what we need to do for the language id. Have recently seen cases with rss feeds 
where the server simply says it is text/xml (which in a way is true) whereas 
Tika would have detected that it was an application/rss+xml. The new Detection 
API in Tika would allow us to do that rather neatly


> MimeUtil to rely on default config provided by Tika
> ---------------------------------------------------
>
>                 Key: NUTCH-1045
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1045
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is 
> absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though 
> but if the user hasn't specified one or if it can't be loaded then we should 
> rely on Tika's default. This way we won't need to provide 
> conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one 
> whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to