Yossi Tamari created NUTCH-2449:
-----------------------------------
Summary: Usage of Tika LanguageIdentifier in language-identifier
plugin
Key: NUTCH-2449
URL: https://issues.apache.org/jira/browse/NUTCH-2449
Project: Nutch
Issue Type: Improvement
Components: plugin
Affects Versions: 1.13
Reporter: Yossi Tamari
The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier
for extracting the language from the document text. There are two issues with
that:
# LanguageIdentifier is deprecated in Tika.
# It does not support CJK language (and I suspect a lot of other languages -
https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
and it doesn’t even fail gracefully with them - in my experience Chinese was
recognized as Italian.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)