[ 
https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086301#comment-13086301
 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

If you can't see it in the metadata displayed by ParserChecker you definitly 
won't get it in IndexerChecker. Could there be something specific in your 
config? You've added the plugin to the list, haven't you?
Have you tried debugging in Eclipse and see if you get to the parser class at 
least?

Thanks!


> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part 
> of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a 
> new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to