[ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086441#comment-13086441 ]
Markus Jelsma commented on NUTCH-1075: -------------------------------------- The clean 1.4 check out i tested as well (see above) failed in the same fashion. I double checked the runtime/local/libs and it indeed uses Tika 0.9 core. I don't think my computer likes me anymore. If you manage to do so with a clean check out + patch there seems to be something really wrong on my system. > Delegate language identification to Tika > ---------------------------------------- > > Key: NUTCH-1075 > URL: https://issues.apache.org/jira/browse/NUTCH-1075 > Project: Nutch > Issue Type: Improvement > Components: parser > Affects Versions: 1.4 > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.4 > > Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch > > > In 2.0 the language identification is delegated to Tika and is done as part > of the parsing step (and not during the indexing as done currently). > The patch attached is a backport from trunk which implements this and adds a > new parameter to determine the strategy to use > {code:xml} > <property> > <name>lang.extraction.policy</name> > <value>detect,identify</value> > <description>This determines when the plugin uses detection and > statistical identification mechanisms. The order in which the > detect and identify are written will determine the extraction > policy. Default case (detect,identify) means the plugin will > first try to extract language info from page headers and metadata, > if this is not successful it will try using tika language > identification. Possible values are: > detect > identify > detect,identify > identify,detect > </description> > </property> > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira