There is certainly a bug in DSPAM that prevents, or at least makes
it extremely difficult, for a sender address to get whitelisted once
it is classified as SPAM. I have noticed and have verified this
behavior.
The problem appears to be that when correcting a NOTSPAM (innocent)
email that has been mis-classified as SPAM, the "auto-whitelisting"
(or "sender") token counts for that email are not updated correctly.
The NOTSPAM count is incremented as expected, but the SPAM count is
NOT decremented.
I looked at the code and noticed that, for auto-whitelisting, the
NOTSPAM count on the sender token must be 15 times the SPAM count on
that token:
if (ds_term->key == whitelist_token &&
ds_term->s.spam_hits <= (ds_term->s.innocent_hits / 15) &&
ds_term->s.innocent_hits > CTX->wh_threshold &&
CTX->classification == DSR_NONE)
{
do_whitelist = 1;
}
I also verified that, even in TOE mode and fully trained, the sender
token counts always get updated: when a message comes in and gets
classified as SPAM, the SPAM count for the sender token gets
incremented; when you correct it (--class=innocent --source=error),
the SPAM count remains the same, while the NOTSPAM count is
incremented.
So I'd have to retrain such a message 15 times to make up for a
single mis-classification (as far as auto-whitelisting is
concerned).
More details:
DSPAM converts the entire "From" string into a token, keeps count of
SPAM and NOTSPAM hits against this token, and bases its auto-
whitlelisting on these counts . The counts seems to be updated
regardless of training mode (which seems like a good thing).
Since I started from scratch with DSPAM last August, I have received
11 "software release" notices from a co-worker that have the same
exact "From" string. I have received no other emails with this same
>From string. Emails 1,2,3,4,5,7 were all correctly classified as
NOTSPAM. 6,8,9,10,11 were all mis-classified as SPAM and corrected.
For each of the 6 correctly classified emails, the sender token
NOTSPAM count was incremented by 1. For each of the 5 mis-
classified messages, the SPAM count was incremented by 1, and
afterwards the NOTSPAM count was incremented by 1 via re-training.
So now a dump looks like this (I've *'ed out the actual address):
dspam_dump sandmant "[EMAIL PROTECTED] (***********)"
12884437171547646301 S: 00005 I: 00011 P: 0.2086
Even though my whitelistThreshold is 10, this email will never get
whitelisted unless I get about 70 more correctly classified emails
(AND no more mis-classifications).
It seems the fix would be to decrement the SPAM count for the sender
token when retraining instead of, or in addition to, incrementing
the NOTSPAM count.
Tod Sandman
Sr. Systems Administrator
Middleware Development & Integration
Rice University