[ 
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217309#comment-16217309
 ] 

ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------

lewismc commented on issue #233: NUTCH-2449: Replace Tika LanguageIdentifier in 
language-identifier
URL: https://github.com/apache/nutch/pull/233#issuecomment-339066193
 
 
   @pipldev thanks
   
   > 90% of the code of the plugin would be copied, and somebody would have to 
manage applying every fix to both plugins.
   
   I agree. There have been VERY few changes to this plugin in almost a decade. 
I don't anticipate that there will be too many changes moving forward.
   
   > I think it would be really confusing to users to find two plugins that do 
exactly the same thing, with the only difference being an implementation detail.
   
   Not if it is documented in a README.md contained within the plugin 
directory. Usually people interested in Linguistics are also very interested in 
the application or techniques used in tasks such as language identification, 
statistical machine translations, etc. i think any confusion could be easily 
avoided if we provide a simple README.
   
   > Keep in mind that while the the existing plugin is stable, it relies on 
deprecated code, and presumably would break eventually when Tika decides to 
remove the deprecated implementation.
   
   I agree, but as I said it is stable. We have tight integration between Nutch 
and Tika dev community. Once the code is removed over in Tika, it is trivial to 
remove the code over here.
   
   At the end of the day it is down to the contributor if this code is to be 
updated or not, I just think preservation is an important part of what we do 
here. Retaining the legacy plugin is doing no harm whatsoever.
   Thanks

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>
> The language-identifier plugin uses 
> org.apache.tika.language.LanguageIdentifier for extracting the language from 
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - 
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
>  and it doesn’t even fail gracefully with them - in my experience Chinese was 
> recognized as Italian.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to