Yossi Tamari created NUTCH-2449:
-----------------------------------

             Summary: Usage of Tika LanguageIdentifier in language-identifier 
plugin
                 Key: NUTCH-2449
                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
             Project: Nutch
          Issue Type: Improvement
          Components: plugin
    Affects Versions: 1.13
            Reporter: Yossi Tamari


The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier 
for extracting the language from the document text. There are two issues with 
that:
# LanguageIdentifier is deprecated in Tika.
# It does not support CJK language (and I suspect a lot of other languages - 
https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
 and it doesn’t even fail gracefully with them - in my experience Chinese was 
recognized as Italian.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to