I18n and l10n (Re: Charset normalization issue (report, patch, and request))

Motoharu Kubo Sat, 14 Jan 2006 18:00:39 -0800

I changed Subject.

Justin Mason wrote:
> I'm not sure I understand why.
> 
> Currently, Bayes is the only code that actually *uses* knowledge of how a
> string is tokenized into words; this isn't exposed to the rules at all.
> 
> If it should be, that's an entirely separate feature request. ;)


Thanks to Justin.  This is an important suggestion to me.

Yes, what I am saying is not only a "Charset" normalization issue.  It
should be called "I18n and l10n" issue.  And it includes charset
normalization issue.

This is why I impose and insist om my splitter function (the name of
this function may not be appropriate).

John's proposal and patch is a great step to start I18n for me.  This
could solve our daily headache and frustration.

However, there is language specific "normalization" issues as I
explained in previous messages.  It might be called l10n issue, in
short.  Japanese language permit that word can be split by LF.  There is
no space between words.  There is some "alias" (there is zenkaku
character and hankaku character with same glyph).  These features are
different from Western language and special handling is necessary before
not only bayes tokenization but body/header tests.

We, Japanese are localizing some application such as browser, word,
spread sheet, and it is maintained separately.  It is because Japanese
version is mainly used by ourselves.  However, e-mail is not bound to
country.  We receive English, Chinese, Hanguel and Japanese spam daily.
 This is why I think I should propose l10n issue at SA's dev list.

I only know Japanese specific issue and I am not sure what specific
issue exist in other language.  So what I could implement is Japanese
issue only.

Based on this consideration the name "splitter" should be renamed to
reflect more comprehensive, intuitive name and it should receive
TextCat's result and original charset information.  This way language
specific process other than Japanese can be written.

----------------------------------------------------------------------
Motoharu Kubo
[EMAIL PROTECTED]

I18n and l10n (Re: Charset normalization issue (report, patch, and request))

Reply via email to