[Bug 4031] bayesian scores lower for higher probability

bugzilla-daemon 15 Dec 2004 17:09:47 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=4031


[EMAIL PROTECTED] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|INVALID                     |DUPLICATE



------- Additional Comments From [EMAIL PROTECTED]  2004-12-15 09:07 -------
Nick,

I did the score optimization for SpamAssassin 3.0.

> So if I understand it right, the perceptron's scores are optimised for
> a rather inaccurate Bayes database ?  Or is it that there was no spam
> that only hit Bayes, so the perceptron thought it could safely reduce
> the Bayes scores ?

The latter is more the case than the former.  We're learning scores by
minimizing an error function that is sort of like:

Err(Msg,Score,Threshold,Class) = { 0 if Class=Spam and Score>=Threshold
                                 { 0 if Class=Ham and Score<Threshold
                                 { abs(Score-Threshold) otherwise

I don't have time to give you any hard numbers right now, but one can safely
assume that messages with BAYES_99 usually have a lot of other rule hits as
well.  Because of this, most messages with BAYES_99 already have high scores and
the value of the error function will almost always be 0.

Another reason why scores for rules like BAYES_99 are smaller than they were
before is that the URIBL rules are too "loud" in the dataset:  Because there are
so many of these rules with very high hit rates, they tend to occupy a
proportionate (or disproportionate, depending on how you look at it) amount of
"mass" of the scores.  

It is a lot easier to make an accurate perceptron with lower scores.  As the
scores are forced higher, the false positive rate goes up.  However, this method
is also vulnerable to attacks from an adversary because the scores for the most
accurate rules tend to be much larger than those of the median rules.  This is
the subject of my masters thesis (which is almost done).  If you want a more in
depth answer, ping me for a copy of my thesis in January.

The short answer is:  For now, the scores are as high as I can make them without
SpamAssassin making false positives all the time.

This bug is a duplicate of Bug 3821.

Henry

*** This bug has been marked as a duplicate of 3821 ***

*** This bug has been marked as a duplicate of 3821 ***



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4031] bayesian scores lower for higher probability

Reply via email to