[jira] [Commented] (NUTCH-2449) Usage of Tika LanguageIdentifier in language-identifier plugin

Yossi Tamari (JIRA) Tue, 24 Oct 2017 07:31:43 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217004#comment-16217004
 ]


Yossi Tamari commented on NUTCH-2449:
-------------------------------------

Since in Tika LanguageIdentifier was superseded by 
org.apache.tika.language.detect.LanguageDetector, it seems obvious to make that 
change in the plugin as well. However, because the design of LanguageDetector 
is terrible, it makes the implementation not reentrant, meaning the full 
language model would have to be reloaded on each call to the detector.

> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>
> The language-identifier plugin uses 
> org.apache.tika.language.LanguageIdentifier for extracting the language from 
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - 
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
>  and it doesn’t even fail gracefully with them - in my experience Chinese was 
> recognized as Italian.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2449) Usage of Tika LanguageIdentifier in language-identifier plugin

Reply via email to