[Bug 3821] scores are overoptimized for training set

bugzilla-daemon 27 Sep 2004 21:22:35 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=3821






------- Additional Comments From [EMAIL PROTECTED]  2004-09-27 14:21 -------
'BTW, this is the "rule reliability tflag" idea again; basically provide a way
to hint that this rule is reliable, and this rule should not be considered
reliable -- no matter what their hit-rates in mass-checks were.'

oh, I should point out -- the point in particular here is that, often, you can
get rules that hit 20%:0.001% spam:ham for a very high S/O -- they would always
be given a good high range, and the perceptron allowed to range those rules
highly.  

However, sometimes a really simple one-word body pattern (f.e. "viagra") may get
1.0%:0.0001% hit-rates.  Given that it's a really simple one-word body pattern,
*we* know that it has a high chance of FP'ing in the field, even if our corpora
don't use it at all -- so a reliability tflag gives us a way to indicate this.

OTOH, at times, we know that another similarly low-frequency rule is very
reliable and won't FP, and so can safely get a high score, but we just don't
have a lot of data that hits it in our corpora.

The current problem is that our scoring code has to be over-paranoid about
ranges for low-frequency rules -- just in case it's the first case and not the
second -- hence restricting them unfairly. 



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3821] scores are overoptimized for training set

Reply via email to