[ 
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460075#comment-17460075
 ] 

ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------

lewismc commented on pull request #233:
URL: https://github.com/apache/nutch/pull/233#issuecomment-994961039


   @YossiTamari @sebastian-nagel this has been sitting for way too long. By the 
looks of the above correspondence the decision was made to overwrite the logic 
in language-identifier with the Optimaize logic.
   [Tika 
LanguageDetector](https://tika.apache.org/2.1.0/api/index.html?org/apache/tika/language/detect/LanguageDetector.html)
 now offers several implementations which I think we could easily drive through 
configuration in nutch-default.xml.
   Where do we want to go with this one? I can help with the effort.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Major
>             Fix For: 1.19
>
>
> The language-identifier plugin uses 
> org.apache.tika.language.LanguageIdentifier for extracting the language from 
> the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - 
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
>  and it doesn’t even fail gracefully with them - in my experience Chinese was 
> recognized as Italian.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to