[
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney resolved NUTCH-2449.
-----------------------------------------
Resolution: Fixed
> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
> Key: NUTCH-2449
> URL: https://issues.apache.org/jira/browse/NUTCH-2449
> Project: Nutch
> Issue Type: Improvement
> Components: plugin
> Affects Versions: 1.13
> Reporter: Yossi Tamari
> Priority: Major
> Fix For: 1.19
>
>
> The language-identifier plugin uses
> org.apache.tika.language.LanguageIdentifier for extracting the language from
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages -
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
> and it doesn’t even fail gracefully with them - in my experience Chinese was
> recognized as Italian.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)