Delegate language identification to Tika ----------------------------------------
Key: NUTCH-1075 URL: https://issues.apache.org/jira/browse/NUTCH-1075 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.4 In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently). The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use {code:xml} <property> <name>lang.extraction.policy</name> <value>detect,identify</value> <description>This determines when the plugin uses detection and statistical identification mechanisms. The order in which the detect and identify are written will determine the extraction policy. Default case (detect,identify) means the plugin will first try to extract language info from page headers and metadata, if this is not successful it will try using tika language identification. Possible values are: detect identify detect,identify identify,detect </description> </property> {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira