Tod A. Sandman wrote, on 15. feb 2007 16:36:

There is certainly a bug in DSPAM that prevents, or at least makes
it extremely difficult, for a sender address to get whitelisted once
it is classified as SPAM.  I have noticed and have verified this
behavior.

The problem appears to be that when correcting a NOTSPAM (innocent)
email that has been mis-classified as SPAM, the "auto-whitelisting"
(or "sender") token counts for that email are not updated correctly.
The NOTSPAM count is incremented as expected, but the SPAM count is
NOT decremented.

I looked at the code and noticed that, for auto-whitelisting, the
NOTSPAM count on the sender token must be 15 times the SPAM count on
that token:

      if (ds_term->key == whitelist_token              &&
          ds_term->s.spam_hits <= (ds_term->s.innocent_hits / 15) &&
          ds_term->s.innocent_hits > CTX->wh_threshold &&
          CTX->classification == DSR_NONE)
      {
        do_whitelist = 1;
      }

I also verified that, even in TOE mode and fully trained, the sender
token counts always get updated:  when a message comes in and gets
classified as SPAM, the SPAM count for the sender token gets
incremented; when you correct it (--class=innocent --source=error),
the SPAM count remains the same, while the NOTSPAM count is
incremented.

So I'd have to retrain such a message 15 times to make up for a single mis-classification (as far as auto-whitelisting is
concerned).


More details:

DSPAM converts the entire "From" string into a token, keeps count of
SPAM and NOTSPAM hits against this token, and bases its auto-
whitlelisting on these counts .  The counts seems to be updated
regardless of training mode (which seems like a good thing).

Since I started from scratch with DSPAM last August, I have received
11 "software release" notices from a co-worker that have the same
exact "From" string.  I have received no other emails with this same
From string.  Emails 1,2,3,4,5,7 were all correctly classified as
NOTSPAM.  6,8,9,10,11 were all mis-classified as SPAM and corrected.

For each of the 6 correctly classified emails, the sender token
NOTSPAM count was incremented by 1.   For each of the 5 mis-
classified messages, the SPAM count was incremented by 1, and
afterwards the NOTSPAM count was incremented by 1 via re-training.

So now a dump looks like this (I've *'ed out the actual address):

  dspam_dump sandmant "[EMAIL PROTECTED] (***********)"
  12884437171547646301 S: 00005  I: 00011  P: 0.2086

Even though my whitelistThreshold is 10, this email will never get
whitelisted unless I get about 70 more correctly classified emails
(AND no more mis-classifications).

It seems the fix would be to decrement the SPAM count for the sender
token when retraining instead of, or in addition to, incrementing
the NOTSPAM count.

I hope that Jonathan reads this and does something about it, though I doubt that he will; seems like he's doing other things nowadays (getting rich? New girl friend?)

We run CVS and I make new rpms for our test and production sites on a regular basis. According to the CHANGELOG for 040207, nothing has been modified since 20061210.

--Tonni

--
Tony Earnshaw
Email: tonni at hetnet dot nl

Reply via email to