Re: Charset normalization issue (report, patch, and request)

Justin Mason Sun, 15 Jan 2006 15:40:36 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


"Loren Wilton" writes:
> > Currently, Bayes is the only code that actually *uses* knowledge of how a
> > string is tokenized into words; this isn't exposed to the rules at all.
> 
> This isn't even slightly true!  Virtually every rule written against English
> spam is in some way concerned with word breaks.  In some cases in
> obfuscation rules the rule may be concerned with ignoring word breaks.  In
> many cases like /you have already won!/i there are implicit word breaks in
> the rule.  Other rules use \b to require word breaks and prevent erroeous
> matches.  If breaks were completely arbitrary, the language would be nigh
> unto unreadable, and virtually all existing rules would fail!

You're misunderstanding me.

Of course the people who write rules, are concerned with where the word
breaks land.  However, the rule-type code doesn't have any knowledge
of word breaks; it's just matching a string of text, against a regexp.
Bayes is the only rule-type code that does.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDyt1PMJF5cimLx9ARAt2vAKCaQ9ehZ7VBsIN6lk0pgQrQ/epDKQCgvde5
T+F+m6iccEfkcpt+8jWXY+k=
=sCX3
-----END PGP SIGNATURE-----

Re: Charset normalization issue (report, patch, and request)

Reply via email to