Feature Requests item #3142744, was opened at 2010-12-23 11:31 Message generated for change (Comment added) made by unwesen You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1126468&aid=3142744&group_id=250683
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Priority: 5 Private: No Submitted By: Jens Finkhaeuser (unwesen) Assigned to: Stevan Bajic (sbajic) Summary: Undo whitelisting suggestion Initial Comment: As I mentioned in a different issue (that I just found while looking for this), it seems very hard to train dspam to *not* whitelist some sender. This all seems to boil down to this code here: if (CTX->flags & DSF_WHITELIST) { if (ds_term->key == whitelist_token &&. ds_term->s.spam_hits <= (ds_term->s.innocent_hits / 15) &&. ds_term->s.innocent_hits > CTX->wh_threshold &&. CTX->classification == DSR_NONE) { do_whitelist = 1; } } Ca. line 930 in libdspam.c. The whitelist_token appears to be calculated from the sender address (or from: line); so I understand the logic that if a sender is found, and it's got 15x as many innocent hits as spam hits, then whitelist the message (leaving out a few details here). I think that logic works well enough for deciding that a sender can be presumed innocent, but it doesn't work very well for suggesting that the sender might in fact not be a good candidate for whitelisting. That logic seems to be in there because the whitelist_token's spam probability is hardcoded to 0.5 (in _ds_calc_stats). Wouldn't it make much more sense to calculate its probability properly, and use wh_threshold as a probability threshold, i.e. if the spam probability is below 0.3 or whatever, then whitelist it? That way you can use the same probability calculation as for other terms and therefore train dspam, but still treat the whitelist token as special in that if it is trained to be ok, then the rest of the tokens get disregarded because the message is whitelisted. I've attached a patch that compiles, but is otherwise untested - mostly because I have no idea of what ramifications the change might have outside the code I touched. Also, it changes the meaning and format of the wh_token config variable, which is most likely *not* what you want. But it'll convey what I mean better than writing more text :) ---------------------------------------------------------------------- >Comment By: Jens Finkhaeuser (unwesen) Date: 2011-05-16 11:44 Message: Oh, the 100% is just what I said, I'd prefer a "N spam hits == no more whitelist" solution. I will send the email. ---------------------------------------------------------------------- Comment By: Stevan Bajic (sbajic) Date: 2011-05-16 11:00 Message: I think -user must be there because you want to change the way how whitelisting is working. Adding -devel would not hurt. One other thing interests me: You write about 99%. Might I ask you what the other 1% would be? What solution would be 100% in your eyes? ---------------------------------------------------------------------- Comment By: Jens Finkhaeuser (unwesen) Date: 2011-05-16 10:56 Message: Sure. Are you thinking of -devel or -user? ---------------------------------------------------------------------- Comment By: Stevan Bajic (sbajic) Date: 2011-05-14 12:43 Message: Using the probability is a double edged sword. I find the suggestion from my last message easier to understand and simpler. So you would have nothing against if something like that would be implemented into DSPAM? If so, then it's time to ask the DSPAM mailing list if other would have an problem if the whitelisting would be changed. Would you mind taking that task? ---------------------------------------------------------------------- Comment By: Jens Finkhaeuser (unwesen) Date: 2011-05-14 12:33 Message: Well, yes, I pretty much want to turn off whitelisting for certain senders. More specifically, I'd really prefer it if the regular retraining mechanism (forwarding/bouncing emails, in my case) could be used for that. The last suggestion you've posted seems like 99% of the solution to me. The reason I would've chosen to compare some probability value rather than resetting the innocent hit count to 0 is mostly because it hardcodes the "one spam hit == no more whitelist" behaviour, whereas "N spam hits == no more whitelist" would leave the users/administrators a bit more choice. I mean, it's entirely possible to introduce a new config variable for that N. I just figured using the token probability leverages an existing mechanism, rather than adding a new one. I don't mind either way, really. Your last suggestion would work just fine for me. ---------------------------------------------------------------------- Comment By: Stevan Bajic (sbajic) Date: 2011-05-14 12:12 Message: btw: one possible way to change the whitelisting functionality and avoid that 15 x could be: * every time a whitelist token gets one spam hit, then automatically reset the innocent hit (setting it to 0) * require for whitelisting that innocent hits - spam hits > whitelist threshold ---------------------------------------------------------------------- Comment By: Stevan Bajic (sbajic) Date: 2011-05-14 11:56 Message: > When I *don't* want a sender to be whitelisted, the > current code fails me. It seems impossible to retrain > a whitelisted message often enough for the whitelisting > to be cancelled out. > This is a tricky issue: 1) use whitelisting 2) don't use whitelisting If you do 1 then: 1) sender (the whole from line) is whitelisted 2) sender (the whole from line) is not whitelisted 3) sender (just the email/domain) is blocklisted 1 and 2 is done automatically with the values you have specified as the whitelist threshold 3 is done by the end user (adding domain or the whole sender email address into the blocklist file) I don't understand 100% what you want? So far I see two things that you maybe want (assuming you want whitelisting): 1) turn off whitelisting for certain senders 2) make whitelisting more draconian if the sender is sending you spam mails > The 15x in this case is plainly wrong: > cancelling whitelisting should ideally be by retraining > *once*; everything else makes for really bad usability. > Well... I am with you. The huge problem is that most users never train or make often stupid errors. So this 15 x is helping them to still get stuff whitelisted even if by accident they mark something as spam. I personally would remove that 15 x. It's IMHO useless. If a sender is already whitelisted and a user is so stupid to mark/train that message as spam then he should be not rewarded by DSPAM. I think that 15 x is there because of this scenario here: Assume one starts fresh with DSPAM. Then a message comes from "Test User <test.u...@example.com>". That message then gets classified as spam. Now assume this happens 20 times. And now the end user realises that he wants "Test User <test.u...@example.com>" to be into his whitelist. So he starts to reclassify new messages as innocent. Assume he has a whitelist threshold of 5. Assume the next 5 messages from "Test User <test.u...@example.com>" are classified as spam but the end user goes on and reclassifies them as ham. So at the end you have: token: From*Test User <test.u...@example.com> spam hits: 20 innocent hits: 5 Without that 15 x that sender would be whitelisted. A sender that has sent 4 times more spam messages than ham messages. With that 15 x in place the end user would need way more retraining/reclassifying to correct his errors from the past. I think this was the original idea behind that 15 x. ---------------------------------------------------------------------- Comment By: Jens Finkhaeuser (unwesen) Date: 2011-05-14 11:33 Message: Alright, I think I understand why my patch wouldn't work well. The problem remains as far as I am concerned: I'm happy enough with the whitelist threshold I've set, but only for when I want a sender whitelisted. The whitelisting kicks in nice and early, and doesn't require me to mark stuff as innocent all the time. That's good. When I *don't* want a sender to be whitelisted, the current code fails me. It seems impossible to retrain a whitelisted message often enough for the whitelisting to be cancelled out. The 15x in this case is plainly wrong: cancelling whitelisting should ideally be by retraining *once*; everything else makes for really bad usability. I'm not sure if I understood the code well enough; what I wanted to achieve with the patch was to base the whitelisting decision off of the spam probability of the from-token. That way I thought you'd use the same training mechanism as for other tokens for whitelisting, and retraining should work equally well. ---------------------------------------------------------------------- Comment By: Stevan Bajic (sbajic) Date: 2011-05-14 11:00 Message: > The whitelist_token appears to be calculated from the > sender address (or from: line); > Correct. > so I understand the logic that if a sender is found, > and it's got 15x as many innocent hits as spam hits, > then whitelist the message (leaving out a few details > here). > Here you are wrong. If you use whitelisting feature in DSPAM then you can specify the whitelist threshold either in dspam.conf or with global/user preferences. So the computation is: 1) if whitelisting is enabled then 1.1) if the token is a whitelist token AND the whitelist token has minimum 15 more innocent hits than spam hits AND the whitelist token innocent hit count is bigger than the whitelist threshold AND the message is not yet classified THEN do whitelist the sender Your patch would completely mess up the whitelisting. The reason why your patch will not work in production is because every one uses different tokenizers and they all lead to different results. So assume you got this from line: From: Test User <t...@example.com> The whitelisting code would now not process that line with the tokenizer but make a hard coded whitelist token: From*Test User <t...@example.com> And it would assign a neutral score for it (aka 0.5) Now assume you change the whitelist threshold to be a probabilistic value of 0.3 and assume you just got one innocent hit and no spam hit for "From*Test User <t...@example.com>" then the computation will quickly produce either 0.0 or 1.0. Just from one single token. And assume a user has completely messed up his token data and has a lot of spam tokens but almost no innocent tokens or the other way around. That as well would completely mess up the whitelist computation. Do you understand what I mean? Beside that it's hard to explain to some one that he now needs to enter 0.2 for the whitelist threshold. I already see the mailing list flooded with questions what a good whitelist threshold is. Where the current implementation allows to say: if you add 100 then you need at least 100 time to get a innocent message (or you need to train them as innocent) from "Test User <t...@example.com>" to have "Test User <t...@example.com>" whitelisted AND you need minimum 15 times more innocent hits than spam hits for "Test User <t...@example.com>" to be whitelisted. While looking at the code I think that one enhancement that could be done to the whitelisting code would be to normalize the innocent hit counter by subtracting the spam hits and then compare against the whitelist threshold. Something like this: if (CTX->flags & DSF_WHITELIST) { if (ds_term->key == whitelist_token && ds_term->s.spam_hits <= (ds_term->s.innocent_hits / 15) && (ds_term->s.innocent_hits - ds_term->s.spam_hits) > CTX->wh_threshold && CTX->classification == DSR_NONE) { do_whitelist = 1; } } > I think that logic works well enough for deciding that > a sender can be presumed innocent, but it doesn't work > very well for suggesting that the sender might in fact > not be a good candidate for whitelisting. > So you got a from line that has 15 times more innocent hits than spam hits and that has a innocent hit bigger than your specified whitelist thrashold that you DO NOT consider worth to whitelist? Have you thought about increasing the whitelist threshold or about blocklisting that sender/domain? IMHO a user and/or domain dependent whitelisting threshold would be a good way to work around such issues. Currently DSPAM does not have that (only has one single whitelist threshold for every sender. Senders are way to easy to forge and using the whole From header line is a IMHO a good trade-off). ---------------------------------------------------------------------- Comment By: Jens Finkhaeuser (unwesen) Date: 2011-05-14 10:11 Message: Did anyone have a chance to look at this? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1126468&aid=3142744&group_id=250683 ------------------------------------------------------------------------------ Achieve unprecedented app performance and reliability What every C/C++ and Fortran developer should know. Learn how Intel has extended the reach of its next-generation tools to help boost performance applications - inlcuding clusters. http://p.sf.net/sfu/intel-dev2devmay _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel