http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686





------- Additional Comments From [EMAIL PROTECTED]  2008-01-18 02:26 -------
I had some off-list discussion with Fidelis about this...

he suggests using ROCA% as a better error-rate measurement system:

Fidelis Assis writes:
> Justin Mason wrote:
> > Fidelis Assis writes:
> >> Justin Mason escreveu:
> >>> Fidelis Assis writes:
> >>>> Justin Mason escreveu:
> >>>>> Fidelis Assis writes:
> >> The other day I was in a discussion on the CRM114 list about error-rate
> >> X ROCA% and I made an analogy to archers showing why I think it's
> >> possibly better for spam filters. It might be interesting, at least as a
> >> curiosity :-)
> >>
> >>
http://sourceforge.net/mailarchive/forum.php?thread_name=200711271356.lARDujYL031322%40spoo.merl.com&forum_name=crm114-general
> > 
> > Ah, that's a very good explanation.  You might have convinced me, I think ;)
> > If I get some time soon, I'll try re-examining those results using
> > 1-ROCA%.

also suggests changing the inputs to the combiner:

> >>>>> from the EDDC equation is used as P(spam) values and fed into our naive
> >>>>> Bayes combiner, producing a value ranging from 0.0 (nonspam) to 0.5
> >>>>> (unsure) to 1.0 (spam).
> >>>> I don't use probabilities directly, but the ratio
> >>>> 0.59*log10(p(ham)/p(spam)). OSBF probabilities are either very close to
> >>>> 1 or to 0.
> >>> hmm, I may try that.

and tried out osbf-lua on my test corpus:

> The filter learns better if the order of the messages is the original, or 
> random, instead of a batch of a class and then a batch of the other. A 
> modified script using random order is attached for your tests.

he gets much better results:

'I did the tests removing the X-Spam-* headers and I got 0 FP and 12 FN
but from the 12, 9 are exactly the same message: msg 33 in spam bucket.4
with Subject: "Congress Proposes Olympic Boycott" (is this spam?);
another 2 are also the same message: msg 165 in spam bucket.2, with
subject: "Notice of account temporary suspension" (paypal phishing). The
last one is another paypal phishing, but with the same contents: msg 174
in spam bucket.6.

If we don't count the same mistake repeatedly we have 0 FP and 3 FN,
which is still very good considering that the filter was trained with
only 422 msgs, and it reaches its max accuracy after 2-3k.'

so the code I've got here is a way off osbf-lua's accuracy rates yet...



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to