[
https://issues.apache.org/jira/browse/NUTCH-619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081699#comment-13081699
]
Lewis John McGibbney commented on NUTCH-619:
--------------------------------------------
If language identification is delegated to Apache Tika, will all of the above
point be considered and addressed?
Understandably Apache Tika is still evolving (and this issue is quite clearly
not), however I suppose the points made above referring to linguistic
properties should be considered within any language identification process.
If on the other hand we can confirm that the above points will be addressed
then I suggest we close this issue and make reference to the fact that it has
been superseded by NUTCH-1075.
> Another Language Identifier Plugin using Unicode code point range
> -----------------------------------------------------------------
>
> Key: NUTCH-619
> URL: https://issues.apache.org/jira/browse/NUTCH-619
> Project: Nutch
> Issue Type: Wish
> Reporter: Vinci
>
> After I checked the language-identifier plugin, I found the internal
> implementation is inefficient for language that can be clear identify based
> on their unicode codepoint (e.g. CJK Language)
> If Nutch work under unicode, can anybody write a language identifier based on
> unicode code point range? The map is here:
> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> also you can refer to NutchAnalysis.jj for some of language code range
> * Some late developed language or rare character - include some CJK
> character, are moved to SIP
> * May be a special property should be set if multiple language character
> detected (languages that are other than English alphabet) - my suggestion
> here is, let CJK locale be the default case as they need bi-gram or other
> analyzer for better indexing
> ** CJK character is very difficult to further divide as they are share han
> characters - if you really want to identify the specific member of CJK, you
> need to use the language identifier plugin
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira