Delegate language identification to Tika
----------------------------------------
Key: NUTCH-1075
URL: https://issues.apache.org/jira/browse/NUTCH-1075
Project: Nutch
Issue Type: Improvement
Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
Fix For: 1.4
In 2.0 the language identification is delegated to Tika and is done as part of
the parsing step (and not during the indexing as done currently).
The patch attached is a backport from trunk which implements this and adds a
new parameter to determine the strategy to use
{code:xml}
<property>
<name>lang.extraction.policy</name>
<value>detect,identify</value>
<description>This determines when the plugin uses detection and
statistical identification mechanisms. The order in which the
detect and identify are written will determine the extraction
policy. Default case (detect,identify) means the plugin will
first try to extract language info from page headers and metadata,
if this is not successful it will try using tika language
identification. Possible values are:
detect
identify
detect,identify
identify,detect
</description>
</property>
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira