http://bugzilla.spamassassin.org/show_bug.cgi?id=4052
[EMAIL PROTECTED] changed:
What |Removed |Added
----------------------------------------------------------------------------
CC|[EMAIL PROTECTED] |
------- Additional Comments From [EMAIL PROTECTED] 2005-01-09 23:31 -------
> 1. In our experience, patterns which span 4 or more words, are often more
> effective at catching a small set of spam, but with very low false positive
> rates, than patterns which match only 1 or 2 words.
> Have you tried modifying the generator so that it generates longer patterns
> from the corpus?
We have finished the experiments on this issue. The Chinese_rules.cf with
different length of patterns are at the following links:
(Note: A Chinese character is encoded by 2 bytes)
1. Subject and Body patterns are about 4 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_4_4
2. Subject patterns are about 4 bytes; Body patterns are about 6 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_4_6
3. Subject patterns are about 6 bytes; Body patterns are about 8 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_6_8
4. Subject patterns are about 8 bytes; Body patterns are about 10 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_8_10
And our experience is: Subject patterns span 4 or more bytes and body paterns
span 6 or more bytes are a good choice.
We have updated our generator (Thank Justin for the suggestion) and I think the
comming versions of Chinese_rules.cf will reach the folowing recall/error rates:
# Test against 20322 spam and 99689 ham
# (using only the Chinese_rules.cf)
#
# Threshold Spam recall Ham error
# 0.5 92.8% 1.2%
# 1.0 90.6% 0.5%
# 1.5 89.0% 0.3%
# 2.0 86.7% 0.1%
# 2.5 84.6% 0.1%
# 3.0 82.3% 0.0%
# 3.5 80.2% 0.0%
# 4.0 78.3% 0.0%
# 4.5 76.5% 0.0%
#
# It takes 0.03 seconds to scan an email with size 2013.54 bytes (P4-2.8G CPU)
Best,
Tran
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.