[
https://issues.apache.org/jira/browse/NUTCH-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche closed NUTCH-619.
-------------------------------
Resolution: Won't Fix
Language identification is now delegated to Tika.
> Another Language Identifier Plugin using Unicode code point range
> -----------------------------------------------------------------
>
> Key: NUTCH-619
> URL: https://issues.apache.org/jira/browse/NUTCH-619
> Project: Nutch
> Issue Type: Wish
> Reporter: Vinci
>
> After I checked the language-identifier plugin, I found the internal
> implementation is inefficient for language that can be clear identify based
> on their unicode codepoint (e.g. CJK Language)
> If Nutch work under unicode, can anybody write a language identifier based on
> unicode code point range? The map is here:
> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> also you can refer to NutchAnalysis.jj for some of language code range
> * Some late developed language or rare character - include some CJK
> character, are moved to SIP
> * May be a special property should be set if multiple language character
> detected (languages that are other than English alphabet) - my suggestion
> here is, let CJK locale be the default case as they need bi-gram or other
> analyzer for better indexing
> ** CJK character is very difficult to further divide as they are share han
> characters - if you really want to identify the specific member of CJK, you
> need to use the language identifier plugin
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira