I changed Subject. Justin Mason wrote: > I'm not sure I understand why. > > Currently, Bayes is the only code that actually *uses* knowledge of how a > string is tokenized into words; this isn't exposed to the rules at all. > > If it should be, that's an entirely separate feature request. ;)
Thanks to Justin. This is an important suggestion to me. Yes, what I am saying is not only a "Charset" normalization issue. It should be called "I18n and l10n" issue. And it includes charset normalization issue. This is why I impose and insist om my splitter function (the name of this function may not be appropriate). John's proposal and patch is a great step to start I18n for me. This could solve our daily headache and frustration. However, there is language specific "normalization" issues as I explained in previous messages. It might be called l10n issue, in short. Japanese language permit that word can be split by LF. There is no space between words. There is some "alias" (there is zenkaku character and hankaku character with same glyph). These features are different from Western language and special handling is necessary before not only bayes tokenization but body/header tests. We, Japanese are localizing some application such as browser, word, spread sheet, and it is maintained separately. It is because Japanese version is mainly used by ourselves. However, e-mail is not bound to country. We receive English, Chinese, Hanguel and Japanese spam daily. This is why I think I should propose l10n issue at SA's dev list. I only know Japanese specific issue and I am not sure what specific issue exist in other language. So what I could implement is Japanese issue only. Based on this consideration the name "splitter" should be renamed to reflect more comprehensive, intuitive name and it should receive TextCat's result and original charset information. This way language specific process other than Japanese can be written. ---------------------------------------------------------------------- Motoharu Kubo [EMAIL PROTECTED]
