Re: Charset normalization issue (report, patch, and request)

MATSUDA Yoh-ichi / 松田陽一 Mon, 16 Jan 2006 07:37:36 -0800

Hello.

From: Motoharu Kubo <[EMAIL PROTECTED]>
Subject: Re: Charset normalization issue (report, patch, and request)
Date: Sun, 15 Jan 2006 13:21:15 +0900


> MATSUDA Yoh-ichi wrote:
> > Spammer's word obfuscation techniques are not only separating LF.
> > 'o' -> '0', 'i' -> '1', 'l' -> '|', 'a' -> '@', and more more...
> > Tokinization isn't fit for these techniques.
> 
> Just an idea.  If there is a good proofreading software, we could detect 
> this kind of obfuscation universally in splitter().  Then we could tell 
> test rules that obfuscation is detected by inserting special mark or 
> some other means.

But, for example, some domain names look like obfuscation words.
All mail texts aren't written only natural words.

REGEX detecting doesn't fit for word obfuscation trick, I think.
It's a bayes area.
--
Japanese spam EXPO :-p
http://www.flcl.org/~yoh/spam/jp/
MATSUDA Yoh-ichi(yoh)
mailto:[EMAIL PROTECTED]
http://www.flcl.org/~yoh/diary/ (only Japanese)

Re: Charset normalization issue (report, patch, and request)

Reply via email to