Another Language Identifier Plugin using Unicode code point range
-----------------------------------------------------------------

                 Key: NUTCH-619
                 URL: https://issues.apache.org/jira/browse/NUTCH-619
             Project: Nutch
          Issue Type: Wish
            Reporter: Vinci


After I checked the language-identifier plugin, I found the internal 
implementation is inefficient for language that can be clear identify based on 
their unicode codepoint  (e.g. CJK Language)

If Nutch work under unicode, can anybody write a language identifier based on 
unicode  code point range? The map is here:
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane

also you can refer to NutchAnalysis.jj for some of language code range 

* Some late developed language or rare character - include some CJK character, 
are moved to SIP
* May be a special property should be set if multiple language character 
detected (languages that are other than English alphabet) - my suggestion here 
is, let CJK locale be the default case as they need bi-gram or other analyzer 
for better indexing
** CJK character is very difficult to further divide as they are share han 
characters - if you really want to identify the specific  member of CJK, you 
need to use the language identifier plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to