On Jun 25, 2010, at 8:20 AM, Ben Luey wrote: > So I changed the tokenizer to osb, but left the trainingmode at TEFT. > Re-ran a big spam corpus and some ham and things started working > better. > I think the main problem was my confusion over the X-DSPAM-Probability > header. It appears that X-DSPAM-Probability is either 0.00000 (not > spam) > or 1.00000 (spam). So it isn't really a probably but a binary > spam/notspam.
I'm to new with dspam to offer suggestions but I can state this assumption is wrong. Here are my dspam headers from a sample spam email. ## X-Dspam-Result: Spam X-Dspam-Processed: Thu Jun 24 22:24:30 2010 X-Dspam-Confidence: 0.4884 X-Dspam-Improbability: 1 in 96 chance of being ham X-Dspam-Probability: 0.9113 X-Dspam-Signature: 11,4c243d8e1239212984381 X-Dspam-Factors: 15, Received*Thu+#+#+#+22, 0.01000, Received*triband +mum, 0.84064, Received*triband+#+#+(triband, 0.84064, Received*from+# +#+#+(triband, 0.84064, Received*mum+#+(triband, 0.84064, Received*from +triband, 0.84064, Received*triband+#+#+#+mum, 0.84064, Received*with +SMTP, 0.16312, Received*SMTP+id, 0.17381, Received*(triband+mum, 0.82486, Received*(Postfix)+#+SMTP, 0.19235, Received*pixilla.com> +Thu, 0.20739, Received*mum+#+#+mum, 0.79026, Received*from+#+mum, 0.79026, Received*by+#+#+#+SMTP, 0.21654 ## And here is a sample ham email. All of my ham emails I have observed have "X-Dspam-Probability: 0.0000". ## X-Dspam-Result: Innocent X-Dspam-Processed: Wed Jun 23 14:02:29 2010 X-Dspam-Confidence: 0.5156 X-Dspam-Improbability: 1 in 107 chance of being spam X-Dspam-Probability: 0.0000 X-Dspam-Signature: 11,4c2276651232104920670 X-Dspam-Factors: 27, Received*for+#+#+#+23, 0.99000, Received*for+#+# +#+23, 0.99000, Received*Wed+23, 0.99000, Received*Wed+23, 0.99000, https+#+https, 0.01000, https+#+https, 0.01000, so+#+#+are, 0.01000, Date*Wed+23, 0.99000, Content-Type*multipart/alternative+#+#+1, 0.01000, Date*23+Jun, 0.99000, Received*23+Jun, 0.99000, Received*23+Jun, 0.99000, 20+https, 0.01000, CLIENTS+WITH, 0.01000, popular+#+#+#+the, 0.01000, we+are, 0.01000, Date*54+0500, 0.01000, 20+#+#+https, 0.01000, were+so, 0.01000, //+#+//, 0.01000, //+#+//, 0.01000, Date*23+#+2010, 0.99000, Received*23+#+2010, 0.99000, Received*23+#+2010, 0.99000, so+#+we, 0.01000, Phone+#+#+#+Fax, 0.01000, Phone+#+#+#+Fax, 0.01000 ## > When I kept seeing headers with 0.00000, I thought dspam > was way off as there should be at least some chance an email was spam. > In reality, X-DSPAM-Confident is the metric to look at to see how > 'close' dspam was on a false positive / negative, while probability > doesn't contain any additional information. > > Ben > > Ben Luey wrote: >> Just wanted to check in again. Every incoming email gets >> >> X-DSPAM-Probability: 0.0000 >> >> I'm not a statistician, but this can't be right. I've trained Dspam >> (3.9.0) on hundredes of spam / not spam from the SA publiccorpus >> and a >> spam-free folder. Every time I get a spam message I retrain the >> filter. >> But still, even blatant spam gets X-DSPAM-Probability: 0.0000. The >> X-DSPAM-Confidence varies from 50% to 100% where the lower the >> confidence, the more likely it is spam. >> >> This can't be normal -- is dspam in some training mode or something? >> Also, I turned on show factors in my configuration, in case this is >> helpful blow are the factors of a 53% confidence, 0% probability >> blatant spam message I got: >> >> X-Original-To*vescentphotonics.com, 0.00313, >> Received*vescentphotonics.com>, 0.00447, Received*2010+13, >> 0.00479, Received*2010+13, 0.00479, >> Received*by+mail.vescentphotonics.com, 0.00533, >> Received*mail.vescentphotonics.com, 0.00533, >> Received*mail.vescentphotonics.com+(Postfix), 0.00534, >> X-Original-To*bugreporter, 0.00691, Date*2010, 0.00938, >> Received*for+<bugreporter, 0.00944, Received*<bugreporter, >> 0.00944, Received*2010, 0.00999, Received*2010, 0.00999, >> Content-Type*1251", 0.99000, X-Greylist*45, 0.01000, DEAR, >> 0.99000, X-MimeOLE*MimeOLE+V6.00.2600.0000, 0.99000, aside, >> 0.99000, X-Mailer*Express+6.00.2600.0000, 0.99000, operation >> +to, >> 0.99000, the+deceased, 0.99000, consent, 0.99000, await >> +your, >> 0.99000, set+aside, 0.99000, Date*2010+13, 0.01000, >> this+transaction, 0.99000, this+transaction, 0.99000 >> >> Thanks, >> >> Ben >> >> ------------------------------------------------------------------------------ >> ThinkGeek and WIRED's GeekDad team up for the Ultimate >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the >> lucky parental unit. See the prize list and enter to win: >> http://p.sf.net/sfu/thinkgeek-promo >> _______________________________________________ >> Dspam-user mailing list >> Dspam-user@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dspam-user >> > > > ------------------------------------------------------------------------------ > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > _______________________________________________ > Dspam-user mailing list > Dspam-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspam-user ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user