DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18933>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18933 Add support for Chinese, Japanese, and Korean to the core build. ------- Additional Comments From [EMAIL PROTECTED] 2003-09-29 06:57 ------- I just hit upon this bug purely by chance and not until a moment ago did I know about lucene so that some or all of the following may not be relevant, for which I apologize to you in advance. To read my comment, you have to set the character encoding of your browser to UTF-8 because it inclues some Korean characters in UTF-8. Korean is NOT like Chinese and Japanese. (Modern) Korean texts do use spaces between words. However, the Korean orthographic standard is rather 'liberal' in *allowing* (the norm is to add spaces between nouns) multiple _nouns_ to be put together without spaces between them when they are used to refer to a single 'entity'/'concept'. Therefore, Korean texts are full of 'megawords' a la German compound words. For instance, in German, 'quantum mechanics' is 'Quantenmechaniker'. In Korean, it's either '양자 역학' (the norm: with a space: English-like) or '양자역학'(more widely used. German-like). The following comment may be off-topic here. What's more relevant to Korean tokenizer (and Japanese tokenizer as well. because both languages are aggultinating languages. On the other hand, Chinese is an isolating language) is the ability to split apart word stems from prefices/sufficies that play a various gramatical roles (tense, honorific form, mode, and so forth) and particles(denoting subject, object,etc). In many applications, gramatically-functional prefices/suffices/particles/words have to be excluded from indexing because they are not 'content-bearing'. Basis Technology's Korean analyzer (www.basistech.com) is quite good (not perfect) at this. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]