[jira] [Commented] (NUTCH-619) Another Language Identifier Plugin using Unicode code point range

Lewis John McGibbney (JIRA) Tue, 09 Aug 2011 08:24:51 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081699#comment-13081699
 ]


Lewis John McGibbney commented on NUTCH-619:
--------------------------------------------

If language identification is delegated to Apache Tika, will all of the above 
point be considered and addressed?

Understandably Apache Tika is still evolving (and this issue is quite clearly 
not), however I suppose the points made above referring to linguistic 
properties should be considered within any language identification process.

If on the other hand we can confirm that the above points will be addressed 
then I suggest we close this issue and make reference to the fact that it has 
been superseded by NUTCH-1075.    

> Another Language Identifier Plugin using Unicode code point range
> -----------------------------------------------------------------
>
>                 Key: NUTCH-619
>                 URL: https://issues.apache.org/jira/browse/NUTCH-619
>             Project: Nutch
>          Issue Type: Wish
>            Reporter: Vinci
>
> After I checked the language-identifier plugin, I found the internal 
> implementation is inefficient for language that can be clear identify based 
> on their unicode codepoint  (e.g. CJK Language)
> If Nutch work under unicode, can anybody write a language identifier based on 
> unicode  code point range? The map is here:
> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> also you can refer to NutchAnalysis.jj for some of language code range 
> * Some late developed language or rare character - include some CJK 
> character, are moved to SIP
> * May be a special property should be set if multiple language character 
> detected (languages that are other than English alphabet) - my suggestion 
> here is, let CJK locale be the default case as they need bi-gram or other 
> analyzer for better indexing
> ** CJK character is very difficult to further divide as they are share han 
> characters - if you really want to identify the specific  member of CJK, you 
> need to use the language identifier plugin

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-619) Another Language Identifier Plugin using Unicode code point range

Reply via email to