Hello. From: Motoharu Kubo <[EMAIL PROTECTED]> Subject: Re: I18n and l10n Date: Tue, 17 Jan 2006 02:22:24 +0900
> MATSUDA Yoh-ichi wrote: > > Is the above flow drawing correct or wrong? > > And, John-san and Motoharu-san's patches are: > > > > | | > > +--------------------------------------------+ | > > | (NEW!) V > > +-> converting html -> UTF-8 character -> [body]->+ > > to plain text normalization | > > +-------------------------------------------------+ > > | (NEW!) > > +-> tokenization -> [bayes] > > by Mecab > > My opinion is to tokenize just after charset normalization. > > UTF-8 character -> tokenization -> [body] > normalization > > I wrote the reason why I insist on this flow several times. In short, > > (1) to join word separated by line break (eg. "a\nb" to "ab" if "ab" is > the word) > (2) to clarify word boundary (eg. "youwon" -> "you won") > > > Many Japanese spams are written in Shift-JIS codeset. > > Shift-JIS detecting rule is convenience. > > My opinion is yes and no. > > - There are many SJIS spams but also many iso-2022-jp encoded spams. Yes. > - All SJIS mails are not spams. A careless alert mail sent from Windows > application is also SJIS encoded (without base64/quoted-printable > encoding). Yes. > - There might be some tendency or difference between SJIS spam and > iso-2022-jp spam but not so significant, I think. I think that is yes and no. All SJIS mails are not spam, but 'spam probability' is high. For example, the case that one received mail characterized: (a) SJIS encoded (b) came from Brazil, Mexico, Russia, Romania, ... (c) dynamic address (d) Razor2 registered (e) BAYES_99 Only (a), we can't recognize whether the mail is spam or ham. But, (a) and (b), spam probability is higher than only (a). Also, (a), (b) and (c) is higher than (a) and (b). Also, (a), (b), (c) and (d) is higher than... SA has meta rule for the above situation. All rules are probability. So, SA calcurates probability. > - Writing rule with hex notation is troublesome, boaring and decreases > productivity. If we could normalize charset, we could write rule > directly with UTF-8 aware editor. Yes. Directly writing REGEX rule with UTF-8 character is very convenience. But I think character normalization and tokenization before body testing is troublesome. Because, character normalization and tokenization is modifying message text, so REGEX rule writer can't recognize against the modified text. Many rules are written for pure plain message text. If character normalization and tokenization are inserted before body testing, many body rules will be unavailable. So, > > But, if the character normalization will insert before body testing, > > my rule will be unavailable. > > > > Do I have to re-write the above 2 rules from [body] to [rawbody]? > > There are two possibilities. > > (1) rewrite from BODY to RAWBODY as Matsuda-san says. > (2) invent NBODY (or something else) apart from BODY. NBODY contains > normalized and tokenized version of body. I once thought of this > idea but did not propose because BODY has problems I mentioned > above and overhead of executing nbody_test increases. I want (2), for the reason of compatibility of rules. -- MATSUDA Yoh-ichi(yoh) mailto:[EMAIL PROTECTED] http://www.flcl.org/~yoh/diary/
