[
https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086318#comment-13086318
]
Markus Jelsma commented on NUTCH-1075:
--------------------------------------
Yes it's in the list. When i reverse the patches i use the old
language-identifier again and then all works well. There are no funky changes
in my check out except that i'm using a very recent tika-app-1.0-SNAPSHOT.jar
instead of a Tika core 0.9.
I just tried a clean 1.4-dev check out with your latest patch and
language-identifier|protocol-http|parse-tika|index-(basic|more|anchor) as
plugins to no avail. The language plugin is registered, both the parser and
indexing hooks are executed properly.
I've added some debugging and commenting to see what it's doing (no eclipse)
and it's clear that lang is always null in HTMLLanguageParser.filter().
> Delegate language identification to Tika
> ----------------------------------------
>
> Key: NUTCH-1075
> URL: https://issues.apache.org/jira/browse/NUTCH-1075
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.4
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.4
>
> Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part
> of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a
> new parameter to determine the strategy to use
> {code:xml}
> <property>
> <name>lang.extraction.policy</name>
> <value>detect,identify</value>
> <description>This determines when the plugin uses detection and
> statistical identification mechanisms. The order in which the
> detect and identify are written will determine the extraction
> policy. Default case (detect,identify) means the plugin will
> first try to extract language info from page headers and metadata,
> if this is not successful it will try using tika language
> identification. Possible values are:
> detect
> identify
> detect,identify
> identify,detect
> </description>
> </property>
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira