Bug Tracker item #3142744, was opened at 2010-12-23 11:31
Message generated for change (Tracker Item Submitted) made by unwesen
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1126467&aid=3142744&group_id=250683

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Jens Finkhaeuser (unwesen)
Assigned to: Nobody/Anonymous (nobody)
Summary: Undo whitelisting suggestion

Initial Comment:
As I mentioned in a different issue (that I just found while looking for this), 
it seems very hard to train dspam to *not* whitelist some sender. This all 
seems to boil down to this code here:

    if (CTX->flags & DSF_WHITELIST) {                                           
         
      if (ds_term->key == whitelist_token              &&.                      
         
          ds_term->s.spam_hits <= (ds_term->s.innocent_hits / 15) &&.           
         
          ds_term->s.innocent_hits > CTX->wh_threshold &&.                      
         
          CTX->classification == DSR_NONE)                                      
         
      {                                                                         
         
        do_whitelist = 1;                                                       
         
      }                                                                         
         
    }

Ca. line 930 in libdspam.c.

The whitelist_token appears to be calculated from the sender address (or from: 
line); so I understand the logic that if a sender is found, and it's got 15x as 
many innocent hits as spam hits, then whitelist the message (leaving out a few 
details here).

I think that logic works well enough for deciding that a sender can be presumed 
innocent, but it doesn't work very well for suggesting that the sender might in 
fact not be a good candidate for whitelisting. That logic seems to be in there 
because the whitelist_token's spam probability is hardcoded to 0.5 (in 
_ds_calc_stats). Wouldn't it make much more sense to calculate its probability 
properly, and use wh_threshold as a probability threshold, i.e. if the spam 
probability is below 0.3 or whatever, then whitelist it?

That way you can use the same probability calculation as for other terms and 
therefore train dspam, but still treat the whitelist token as special in that 
if it is trained to be ok, then the rest of the tokens get disregarded because 
the message is whitelisted.

I've attached a patch that compiles, but is otherwise untested - mostly because 
I have no idea of what ramifications the change might have outside the code I 
touched. Also, it changes the meaning and format of the wh_token config 
variable, which is most likely *not* what you want. But it'll convey what I 
mean better than writing more text :)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1126467&aid=3142744&group_id=250683

------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to