[
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461770#comment-17461770
]
Hudson commented on NUTCH-2449:
-------------------------------
ABORTED: Integrated in Jenkins build Nutch » Nutch-trunk #63 (See
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/63/])
NUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716)
(github:
[https://github.com/apache/nutch/commit/a9b50a7c7e0ab83865883bf87f2c98f1ce354388])
* (add) src/plugin/language-identifier/build-ivy.xml
* (edit) src/plugin/language-identifier/build.xml
> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
> Key: NUTCH-2449
> URL: https://issues.apache.org/jira/browse/NUTCH-2449
> Project: Nutch
> Issue Type: Improvement
> Components: plugin
> Affects Versions: 1.13
> Reporter: Yossi Tamari
> Priority: Major
> Fix For: 1.19
>
>
> The language-identifier plugin uses
> org.apache.tika.language.LanguageIdentifier for extracting the language from
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages -
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
> and it doesn’t even fail gracefully with them - in my experience Chinese was
> recognized as Italian.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)