Tod A. Sandman wrote, on 15. feb 2007 16:36:
There is certainly a bug in DSPAM that prevents, or at least makes
it extremely difficult, for a sender address to get whitelisted once
it is classified as SPAM. I have noticed and have verified this
behavior.
The problem appears to be that when correcting a NOTSPAM (innocent)
email that has been mis-classified as SPAM, the "auto-whitelisting"
(or "sender") token counts for that email are not updated correctly.
The NOTSPAM count is incremented as expected, but the SPAM count is
NOT decremented.
I looked at the code and noticed that, for auto-whitelisting, the
NOTSPAM count on the sender token must be 15 times the SPAM count on
that token:
if (ds_term->key == whitelist_token &&
ds_term->s.spam_hits <= (ds_term->s.innocent_hits / 15) &&
ds_term->s.innocent_hits > CTX->wh_threshold &&
CTX->classification == DSR_NONE)
{
do_whitelist = 1;
}
I also verified that, even in TOE mode and fully trained, the sender
token counts always get updated: when a message comes in and gets
classified as SPAM, the SPAM count for the sender token gets
incremented; when you correct it (--class=innocent --source=error),
the SPAM count remains the same, while the NOTSPAM count is
incremented.
So I'd have to retrain such a message 15 times to make up for a
single mis-classification (as far as auto-whitelisting is
concerned).
More details:
DSPAM converts the entire "From" string into a token, keeps count of
SPAM and NOTSPAM hits against this token, and bases its auto-
whitlelisting on these counts . The counts seems to be updated
regardless of training mode (which seems like a good thing).
Since I started from scratch with DSPAM last August, I have received
11 "software release" notices from a co-worker that have the same
exact "From" string. I have received no other emails with this same
From string. Emails 1,2,3,4,5,7 were all correctly classified as
NOTSPAM. 6,8,9,10,11 were all mis-classified as SPAM and corrected.
For each of the 6 correctly classified emails, the sender token
NOTSPAM count was incremented by 1. For each of the 5 mis-
classified messages, the SPAM count was incremented by 1, and
afterwards the NOTSPAM count was incremented by 1 via re-training.
So now a dump looks like this (I've *'ed out the actual address):
dspam_dump sandmant "[EMAIL PROTECTED] (***********)"
12884437171547646301 S: 00005 I: 00011 P: 0.2086
Even though my whitelistThreshold is 10, this email will never get
whitelisted unless I get about 70 more correctly classified emails
(AND no more mis-classifications).
It seems the fix would be to decrement the SPAM count for the sender
token when retraining instead of, or in addition to, incrementing
the NOTSPAM count.
I hope that Jonathan reads this and does something about it, though I
doubt that he will; seems like he's doing other things nowadays (getting
rich? New girl friend?)
We run CVS and I make new rpms for our test and production sites on a
regular basis. According to the CHANGELOG for 040207, nothing has been
modified since 20061210.
--Tonni
--
Tony Earnshaw
Email: tonni at hetnet dot nl