Another Language Identifier Plugin using Unicode code point range -----------------------------------------------------------------
Key: NUTCH-619 URL: https://issues.apache.org/jira/browse/NUTCH-619 Project: Nutch Issue Type: Wish Reporter: Vinci After I checked the language-identifier plugin, I found the internal implementation is inefficient for language that can be clear identify based on their unicode codepoint (e.g. CJK Language) If Nutch work under unicode, can anybody write a language identifier based on unicode code point range? The map is here: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane also you can refer to NutchAnalysis.jj for some of language code range * Some late developed language or rare character - include some CJK character, are moved to SIP * May be a special property should be set if multiple language character detected (languages that are other than English alphabet) - my suggestion here is, let CJK locale be the default case as they need bi-gram or other analyzer for better indexing ** CJK character is very difficult to further divide as they are share han characters - if you really want to identify the specific member of CJK, you need to use the language identifier plugin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.