On Mon, Nov 21, 2005 at 12:15:13PM -0800, Justin Mason wrote:
> By the way, it's worth noting that SpamAssassin's bayes-like probabilistic
> classifier (in other words the "BAYES_*" rules), and spambayes are almost
> identical in implementation, so results should be very similar.
>
> However, we include a method for expiring data, to control database
> growth, whereas they do not (afaik); in addition, there's been a few
> differences in tokenizer features, and the odd constant tweaked. But
> generally, it should be very close.
>
> Having said that, we haven't changed the tokenizer or algorithms in quite
> a while, and they may have been innovating in the meantime. ;)
Hmmm.... that could be checked by running the SMO on just the
BAYES_* rules, or use a fixed weight corresponding to the interval
they represent. Best way would be a simple way to get the probability
of the bayes classifier within SA for each mail - is that possible?
If the trade-off curve of that simplified model is exactly the same
as SpamBayes, there have been no changes. And we could also see
exactly how much of the observed performance is due to the included
bayes learner, and how much is due to the contribution of other rules.
However, the current results suggest that the bayes learner does
most of the work (although about 200-300 other rules also get
nonzero scores), at least when a sufficient number of mails have
been trained. So I would not expect any surprising results.
IMHO the rules are probably worth more when very few or no mails
have been trained. It should also be noted that we used the
initial SA model for bootstrapping big parts of our spam collection,
for which it is quite useful because of its low FP rate. However,
the spammers _do_ get ahead of SpamAssassin ever more rapidly. I was
rather shocked at its FN rate going from 30% to 70% over a period of
18 weeks of my own mails, again for the default rule set.
An incremental feedback mechanism that allows the score set to
evolve could help - even though every SA installation started out
with the same score set, this would allow each installation to shift
the score set in different directions, making the spammers' job
harder. Or you could prevent in some way that the default score set
is easily available to automated downloaders, perhaps with some
human-intelligence tests (e.g. reading digits and characters from
bitmaps). Or you could also go the way of BrightMail and distribute
changes in the score set to all official SA installations, weeding
out the spammers' installations. One of these might work to improve
the situation.
Best,
Alex
--
Dr.techn. Alexander K. Seewald
Solutions for the 21st century +43(664)1106886
------------------------------------------------
Information wants to be free;
Information also wants to be expensive (S.Brant)
--------------- alex.seewald.at ----------------