[ 
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217284#comment-16217284
 ] 

ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------

pipldev commented on issue #233: NUTCH-2449: Replace Tika LanguageIdentifier in 
language-identifier
URL: https://github.com/apache/nutch/pull/233#issuecomment-339063168
 
 
   Sebastian made the same suggestion, but I really don't like it personally 
for two reasons:
   
   1. 90% of the code of the plugin would be copied, and somebody would have to 
manage applying every fix to both plugins.
   2. I think it would be really confusing to users to find two plugins that do 
exactly the same thing, with the only difference being an implementation detail.
   
   Keep in mind that while the the existing plugin is stable, it relies on 
deprecated code, and presumably would break eventually when Tika decides to 
remove the deprecated implementation.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>
> The language-identifier plugin uses 
> org.apache.tika.language.LanguageIdentifier for extracting the language from 
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - 
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
>  and it doesn’t even fail gracefully with them - in my experience Chinese was 
> recognized as Italian.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to