[Bug 3821] scores are overoptimized for training set

bugzilla-daemon 1 Oct 2004 04:06:46 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=3821






------- Additional Comments From [EMAIL PROTECTED]  2004-09-30 21:06 -------
I agree this is a better place to discuss the philosophical question: 
> If RFCI was perfect at hitting otherwise missed spam with no FPs except 
roaringpenguin.com, and the mail used to score the perceptron had much less 
than 1 in 2500 mails from roaringpenguin.com, is it correct to let the rule 
get a very high score?

What do we mean by "very high score?" My practice is that any rule which hits 
ANY ham gets scored no higher than 1/3 of required_hits. Those that hit lotsa 
spam will get that RH/3, but no higher. 

Philosophically, if we had a rule which we knew /should/ hit the occasional 
ham, then I would similarly limit it, even if that theoretical ham was not in 
any testing corpus. 

RH/3 is simply my rule of thumb, because I generally deal with a limited 
corpus of only 100k emails or so. IMO, if tested via corpora with enough 
emails for testing, RH/2 wouldn't be unreasonable. 




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3821] scores are overoptimized for training set

Reply via email to