http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4787





------- Additional Comments From [EMAIL PROTECTED]  2006-07-05 15:42 -------
Created an attachment (id=3567)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3567&action=view)
proof of concept

this patch implements token tracking to prevent issues where lots of ham/spam
has been learned, but all the tokens have been expired for one, causing bayes
to lean too far one way ie (BAYES_00 or BAYES_99 on all mail).

it also implements ham:spam ratio restrictions, which will prevent the
autolearner from learning too much ham when the ratio is high, and too much
spam with the ratio is low.

the proof of concept code only applies to the BayesStore/SQL.pm, so in order to
test it, you'd need to be using

bayes_store_module Mail::SpamAssassin::BayesStore::SQL

since my box that i'm testing here learns alot of spam, and little ham, the
token ratio is always on the bottom end of the min ratio.

[12005] dbg: bayes: ham:spam token ratio (0.74:1), min ratio (0.75:1), max
ratio (1.25:1)
[12005] dbg: bayes: skip autolearn of spam because ham:spam token ratio (0.74)
is less than min ratio (0.75)

as you can see from the autolearn results, its skipped a bunch of spam learns
today... 

# grep -c autolearn=ham spamd.log
652
# grep -c autolearn=spam spamd.log
859
# grep -c autolearn=unavailable spamd.log
5141

but thats because i've set my min/max ratios so close at 0.75-1.25.  If you
want to learn alot more spam, you could simply use 0.5-2.0 which is the
default... or you could even lower that 0.5 to something like 0.25 if you want
to learn up to 4x more spam than ham.

realize that this code is not drop in ready, as it requires a couple SQL alters
to track spam/ham token counts.

ALTER TABLE bayes_vars ADD spam_token_count int(11) NOT NULL default '0' AFTER
token_count;
ALTER TABLE bayes_vars ADD ham_token_count int(11) NOT NULL default '0' AFTER
spam_token_count;




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to