Feature Requests item #3142744, was opened at 2010-12-23 12:31
Message generated for change (Settings changed) made by sbajic
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1126468&aid=3142744&group_id=250683

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Priority: 5
Private: No
Submitted By: Jens Finkhaeuser (unwesen)
>Assigned to: Stevan Bajic (sbajic)
Summary: Undo whitelisting suggestion

Initial Comment:
As I mentioned in a different issue (that I just found while looking for this), 
it seems very hard to train dspam to *not* whitelist some sender. This all 
seems to boil down to this code here:

    if (CTX->flags & DSF_WHITELIST) {                                           
         
      if (ds_term->key == whitelist_token              &&.                      
         
          ds_term->s.spam_hits <= (ds_term->s.innocent_hits / 15) &&.           
         
          ds_term->s.innocent_hits > CTX->wh_threshold &&.                      
         
          CTX->classification == DSR_NONE)                                      
         
      {                                                                         
         
        do_whitelist = 1;                                                       
         
      }                                                                         
         
    }

Ca. line 930 in libdspam.c.

The whitelist_token appears to be calculated from the sender address (or from: 
line); so I understand the logic that if a sender is found, and it's got 15x as 
many innocent hits as spam hits, then whitelist the message (leaving out a few 
details here).

I think that logic works well enough for deciding that a sender can be presumed 
innocent, but it doesn't work very well for suggesting that the sender might in 
fact not be a good candidate for whitelisting. That logic seems to be in there 
because the whitelist_token's spam probability is hardcoded to 0.5 (in 
_ds_calc_stats). Wouldn't it make much more sense to calculate its probability 
properly, and use wh_threshold as a probability threshold, i.e. if the spam 
probability is below 0.3 or whatever, then whitelist it?

That way you can use the same probability calculation as for other terms and 
therefore train dspam, but still treat the whitelist token as special in that 
if it is trained to be ok, then the rest of the tokens get disregarded because 
the message is whitelisted.

I've attached a patch that compiles, but is otherwise untested - mostly because 
I have no idea of what ramifications the change might have outside the code I 
touched. Also, it changes the meaning and format of the wh_token config 
variable, which is most likely *not* what you want. But it'll convey what I 
mean better than writing more text :)

----------------------------------------------------------------------

Comment By: Stevan Bajic (sbajic)
Date: 2011-05-14 13:43

Message:
Using the probability is a double edged sword. I find the suggestion from
my last message easier to understand and simpler. So you would have nothing
against if something like that would be implemented into DSPAM? If so, then
it's time to ask the DSPAM mailing list if other would have an problem if
the whitelisting would be changed. Would you mind taking that task?

----------------------------------------------------------------------

Comment By: Jens Finkhaeuser (unwesen)
Date: 2011-05-14 13:33

Message:
Well, yes, I pretty much want to turn off whitelisting for certain senders.
More specifically, I'd really prefer it if the regular retraining mechanism
(forwarding/bouncing emails, in my case) could be used for that.

The last suggestion you've posted seems like 99% of the solution to me.
The reason I would've chosen to compare some probability value rather than
resetting the innocent hit count to 0 is mostly because it hardcodes the
"one spam hit == no more whitelist" behaviour, whereas "N spam hits == no
more whitelist" would leave the users/administrators a bit more choice.

I mean, it's entirely possible to introduce a new config variable for that
N. I just figured using the token probability leverages an existing
mechanism, rather than adding a new one. I don't mind either way, really.
Your last suggestion would work just fine for me.

----------------------------------------------------------------------

Comment By: Stevan Bajic (sbajic)
Date: 2011-05-14 13:12

Message:
btw: one possible way to change the whitelisting functionality and avoid
that 15 x could be:

* every time a whitelist token gets one spam hit, then automatically reset
the innocent hit (setting it to 0)
* require for whitelisting that innocent hits - spam hits > whitelist
threshold

----------------------------------------------------------------------

Comment By: Stevan Bajic (sbajic)
Date: 2011-05-14 12:56

Message:
> When I *don't* want a sender to be whitelisted, the
> current code fails me. It seems impossible to retrain
> a whitelisted message often enough for the whitelisting
> to be cancelled out.
> 
This is a tricky issue:
1) use whitelisting
2) don't use whitelisting

If you do 1 then:
1) sender (the whole from line) is whitelisted
2) sender (the whole from line) is not whitelisted
3) sender (just the email/domain) is blocklisted

1 and 2 is done automatically with the values you have specified as the
whitelist threshold
3 is done by the end user (adding domain or the whole sender email address
into the blocklist file)

I don't understand 100% what you want? So far I see two things that you
maybe want (assuming you want whitelisting):
1) turn off whitelisting for certain senders
2) make whitelisting more draconian if the sender is sending you spam
mails


> The 15x in this case is plainly wrong:
> cancelling whitelisting should ideally be by retraining
> *once*; everything else makes for really bad usability.
> 
Well... I am with you. The huge problem is that most users never train or
make often stupid errors. So this 15 x is helping them to still get stuff
whitelisted even if by accident they mark something as spam. I personally
would remove that 15 x. It's IMHO useless. If a sender is already
whitelisted and a user is so stupid to mark/train that message as spam then
he should be not rewarded by DSPAM.

I think that 15 x is there because of this scenario here: Assume one
starts fresh with DSPAM. Then a message comes from "Test User
<test.u...@example.com>". That message then gets classified as spam. Now
assume this happens 20 times. And now the end user realises that he wants
"Test User <test.u...@example.com>" to be into his whitelist. So he starts
to reclassify new messages as innocent. Assume he has a whitelist threshold
of 5. Assume the next 5 messages from "Test User <test.u...@example.com>"
are classified as spam but the end user goes on and reclassifies them as
ham. So at the end you have:

token: From*Test User <test.u...@example.com>
spam hits: 20
innocent hits: 5

Without that 15 x that sender would be whitelisted. A sender that has sent
4 times more spam messages than ham messages. With that 15 x in place the
end user would need way more retraining/reclassifying to correct his errors
from the past. I think this was the original idea behind that 15 x.

----------------------------------------------------------------------

Comment By: Jens Finkhaeuser (unwesen)
Date: 2011-05-14 12:33

Message:
Alright, I think I understand why my patch wouldn't work well.

The problem remains as far as I am concerned: I'm happy enough with the
whitelist threshold I've set, but only for when I want a sender
whitelisted. The whitelisting kicks in nice and early, and doesn't require
me to mark stuff as innocent all the time. That's good.

When I *don't* want a sender to be whitelisted, the current code fails me.
It seems impossible to retrain a whitelisted message often enough for the
whitelisting to be cancelled out. The 15x in this case is plainly wrong:
cancelling whitelisting should ideally be by retraining *once*; everything
else makes for really bad usability.

I'm not sure if I understood the code well enough; what I wanted to
achieve with the patch was to base the whitelisting decision off of the
spam probability of the from-token. That way I thought you'd use the same
training mechanism as for other tokens for whitelisting, and retraining
should work equally well.

----------------------------------------------------------------------

Comment By: Stevan Bajic (sbajic)
Date: 2011-05-14 12:00

Message:
> The whitelist_token appears to be calculated from the
> sender address (or from: line);
>
Correct.


> so I understand the logic that if a sender is found,
> and it's got 15x as many innocent hits as spam hits,
> then whitelist the message (leaving out a few details
> here).
>
Here you are wrong. If you use whitelisting feature in DSPAM then you can
specify the whitelist threshold either in dspam.conf or with global/user
preferences. So the computation is:
1) if whitelisting is enabled then
1.1) if the token is a whitelist token AND the whitelist token has minimum
15 more innocent hits than spam hits AND the whitelist token innocent hit
count is bigger than the whitelist threshold AND the message is not yet
classified THEN do whitelist the sender

Your patch would completely mess up the whitelisting. The reason why your
patch will not work in production is because every one uses different
tokenizers and they all lead to different results. So assume you got this
from line:
From: Test User <t...@example.com>

The whitelisting code would now not process that line with the tokenizer
but make a hard coded whitelist token:
From*Test User <t...@example.com>

And it would assign a neutral score for it (aka 0.5)

Now assume you change the whitelist threshold to be a probabilistic value
of 0.3 and assume you just got one innocent hit and no spam hit for
"From*Test User <t...@example.com>" then the computation will quickly
produce either 0.0 or 1.0. Just from one single token. And assume a user
has completely messed up his token data and has a lot of spam tokens but
almost no innocent tokens or the other way around. That as well would
completely mess up the whitelist computation. Do you understand what I
mean?

Beside that it's hard to explain to some one that he now needs to enter
0.2 for the whitelist threshold. I already see the mailing list flooded
with questions what a good whitelist threshold is. Where the current
implementation allows to say: if you add 100 then you need at least 100
time to get a innocent message (or you need to train them as innocent) from
"Test User <t...@example.com>" to have "Test User <t...@example.com>"
whitelisted AND you need minimum 15 times more innocent hits than spam hits
for "Test User <t...@example.com>" to be whitelisted.

While looking at the code I think that one enhancement that could be done
to the whitelisting code would be to normalize the innocent hit counter by
subtracting the spam hits and then compare against the whitelist threshold.
Something like this:
if (CTX->flags & DSF_WHITELIST) {
  if (ds_term->key == whitelist_token &&
      ds_term->s.spam_hits <= (ds_term->s.innocent_hits / 15) &&
      (ds_term->s.innocent_hits - ds_term->s.spam_hits) >
CTX->wh_threshold &&
      CTX->classification == DSR_NONE)
  {
    do_whitelist = 1;
  }
}


> I think that logic works well enough for deciding that
> a sender can be presumed innocent, but it doesn't work
> very well for suggesting that the sender might in fact
> not be a good candidate for whitelisting.
>
So you got a from line that has 15 times more innocent hits than spam hits
and that has a innocent hit bigger than your specified whitelist thrashold
that you DO NOT consider worth to whitelist? Have you thought about
increasing the whitelist threshold or about blocklisting that
sender/domain?

IMHO a user and/or domain dependent whitelisting threshold would be a good
way to work around such issues. Currently DSPAM does not have that (only
has one single whitelist threshold for every sender. Senders are way to
easy to forge and using the whole From header line is a IMHO a good
trade-off).

----------------------------------------------------------------------

Comment By: Jens Finkhaeuser (unwesen)
Date: 2011-05-14 11:11

Message:
Did anyone have a chance to look at this?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1126468&aid=3142744&group_id=250683

------------------------------------------------------------------------------
Achieve unprecedented app performance and reliability
What every C/C++ and Fortran developer should know.
Learn how Intel has extended the reach of its next-generation tools
to help boost performance applications - inlcuding clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to