[
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460219#comment-17460219
]
ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------
sebastian-nagel commented on pull request #233:
URL: https://github.com/apache/nutch/pull/233#issuecomment-995167042
Hi @lewismc, via NUTCH-2891/ b0cbea5 we already switched to the Optimaize
language detector wrapped by Tika. I have it on my list to verify that this PR
(and NUTCH-1397) is resolved en passant but hadn't the time to look over it. If
you want, feel free to take over. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
> Key: NUTCH-2449
> URL: https://issues.apache.org/jira/browse/NUTCH-2449
> Project: Nutch
> Issue Type: Improvement
> Components: plugin
> Affects Versions: 1.13
> Reporter: Yossi Tamari
> Priority: Major
> Fix For: 1.19
>
>
> The language-identifier plugin uses
> org.apache.tika.language.LanguageIdentifier for extracting the language from
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages -
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
> and it doesn’t even fail gracefully with them - in my experience Chinese was
> recognized as Italian.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)