[Bug 3821] scores are overoptimized for training set

bugzilla-daemon 27 Sep 2004 14:28:31 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=3821






------- Additional Comments From [EMAIL PROTECTED]  2004-09-27 07:28 -------
Subject: Re:  scores are overoptimized for training set

Hi Matthias,

Mike Brzozowski at Stanford's AI lab has been doing experiments using 
support vector machines and logistic regression to classify messages.  
 From what I've seen, their results are no better nor worse from mine 
with the perceptron, so I doubt that changing the learning algorithm 
itself will have any effect.

The problem lies in the fact that this is an adversarial classification 
problem which adds some constraints to the solution space.  For any 
message, M, the adversary must not be able to create a message M' that 
triggers additional rules where P(Spam|M) < P(Spam|M').  In the case of 
margin classifiers like perceptrons and support vector machines, this 
means that there can be no negative weights for rules that the adversary 
can affect.  For any message, M, if the adversary creates a message M' 
that triggers fewer rules, P(Spam|M) should not be greatly larger than 
P(Spam|M').  This means that the weights must be as large as possible 
without causing unnecessary false positives.

The senior SpamAssassin developers wanted a false positive rate of 0.04% 
(1:2500) for each configuration using the default threshold.  To 
accomplish this, we scaled the allowable ranges for the scores and 
trained the perceptron using a cross validation to choose the set of 
parameters that best met our needs.  This is why the scores seem so low.

Even though the scores seem low, they are actually quite good.  For set 
3 (network+bayes), the results of our cross validations suggest that 
SpamAssassin will have a false positive rate of approximately 0.9%.  
This seems about right for my own personal e-mail (I do not contribute 
to the corpus that we use to train the classifier).

In the coming weeks, we will have to pay close attention to how spammers 
are able to defeat SpamAssassin.  If we find that the scores really are 
too low, I can quickly generate new ones with wider ranges for a 3.0.1 
release.  Keep in mind that our top priority is precision, not recall.

Henry

>------- Additional Comments From [EMAIL PROTECTED]  2004-09-27 05:22 -------
>The bug shows two principle problems with perceptrons: 
>1.) They are only guaranteed to converge on a local optimum.
>2.) They, in general, have not protection from overlearning, meaning that they
>"learn the training data set by heart", failing to generalize to new cases
>(messages not previously trained).
>
>Both might have happened in the Bayes-Score example.  
>(Also, note that the coding of the output from the Bayes-Classificator is
>unneccessary hard to learn for the perceptron: One single Bayes-Score value
>(with a real number from [0, 1]) would be much easier to learn.)
>
>A real fix for the problem would be not to use perceptrons at all.  Other
>machine learning algorithms (Boosting or Support Vector Machines) have much
>better regularization properties and they are guaranteed to converge on a 
>global
>optimum.
>
>Sure, the perceptron is an improvement over the GA.  But, IMHO, it is still not
>the best way to go.
>
>
>
>------- You are receiving this mail because: -------
>You are the assignee for the bug, or are watching the assignee.
>  
>





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3821] scores are overoptimized for training set

Reply via email to