http://bugzilla.spamassassin.org/show_bug.cgi?id=4052

[EMAIL PROTECTED] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|[EMAIL PROTECTED]        |



------- Additional Comments From [EMAIL PROTECTED]  2005-01-09 23:31 -------
> 1. In our experience, patterns which span 4 or more words, are often more
> effective at catching a small set of spam, but with very low false positive
> rates, than patterns which match only 1 or 2 words.

> Have you tried modifying the generator so that it generates longer patterns
> from the corpus?

We have finished the experiments on this issue. The Chinese_rules.cf with 
different length of patterns are at the following links:

(Note: A Chinese character is encoded by 2 bytes)


1. Subject and Body patterns are about 4 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_4_4

2. Subject patterns are about 4 bytes; Body patterns are about 6 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_4_6

3. Subject patterns are about 6 bytes; Body patterns are about 8 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_6_8

4. Subject patterns are about 8 bytes; Body patterns are about 10 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_8_10

And our experience is: Subject patterns span 4 or more bytes and body paterns 
span 6 or more bytes are a good choice.

We have updated our generator (Thank Justin for the suggestion) and I think the 
comming versions of Chinese_rules.cf will reach the folowing recall/error rates:

# Test against 20322 spam and 99689 ham
# (using only the Chinese_rules.cf)
#
#       Threshold       Spam recall     Ham error
#       0.5     92.8%   1.2%
#       1.0     90.6%   0.5%
#       1.5     89.0%   0.3%
#       2.0     86.7%   0.1%
#       2.5     84.6%   0.1%
#       3.0     82.3%   0.0%
#       3.5     80.2%   0.0%
#       4.0     78.3%   0.0%
#       4.5     76.5%   0.0%
#
# It takes 0.03 seconds to scan an email with size 2013.54 bytes (P4-2.8G CPU)

Best,
Tran



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to