[ 
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217264#comment-16217264
 ] 

ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------

lewismc commented on issue #233: NUTCH-2449: Replace Tika LanguageIdentifier in 
language-identifier
URL: https://github.com/apache/nutch/pull/233#issuecomment-339061036
 
 
   Hi @YossiTamari I think that this feature would be better proposed as a 
brand new plugin... the existing language-identifier plugin is stable and there 
is no real reason to get entirely rid of it. There is however always purpose in 
adding a new plugin... folks can then choose as they wish.
   Can you repurpose this pull request to add a new plugin?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>
> The language-identifier plugin uses 
> org.apache.tika.language.LanguageIdentifier for extracting the language from 
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - 
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
>  and it doesn’t even fail gracefully with them - in my experience Chinese was 
> recognized as Italian.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to