[ 
https://issues.apache.org/jira/browse/NUTCH-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-619.
-------------------------------

    Resolution: Won't Fix

Language identification is now delegated to Tika.
                
> Another Language Identifier Plugin using Unicode code point range
> -----------------------------------------------------------------
>
>                 Key: NUTCH-619
>                 URL: https://issues.apache.org/jira/browse/NUTCH-619
>             Project: Nutch
>          Issue Type: Wish
>            Reporter: Vinci
>
> After I checked the language-identifier plugin, I found the internal 
> implementation is inefficient for language that can be clear identify based 
> on their unicode codepoint  (e.g. CJK Language)
> If Nutch work under unicode, can anybody write a language identifier based on 
> unicode  code point range? The map is here:
> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> also you can refer to NutchAnalysis.jj for some of language code range 
> * Some late developed language or rare character - include some CJK 
> character, are moved to SIP
> * May be a special property should be set if multiple language character 
> detected (languages that are other than English alphabet) - my suggestion 
> here is, let CJK locale be the default case as they need bi-gram or other 
> analyzer for better indexing
> ** CJK character is very difficult to further divide as they are share han 
> characters - if you really want to identify the specific  member of CJK, you 
> need to use the language identifier plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to