So I changed the tokenizer to osb, but left the trainingmode at TEFT. Re-ran a big spam corpus and some ham and things started working better. I think the main problem was my confusion over the X-DSPAM-Probability header. It appears that X-DSPAM-Probability is either 0.00000 (not spam) or 1.00000 (spam). So it isn't really a probably but a binary spam/notspam. When I kept seeing headers with 0.00000, I thought dspam was way off as there should be at least some chance an email was spam. In reality, X-DSPAM-Confident is the metric to look at to see how 'close' dspam was on a false positive / negative, while probability doesn't contain any additional information.
Ben Ben Luey wrote: > Just wanted to check in again. Every incoming email gets > > X-DSPAM-Probability: 0.0000 > > I'm not a statistician, but this can't be right. I've trained Dspam > (3.9.0) on hundredes of spam / not spam from the SA publiccorpus and a > spam-free folder. Every time I get a spam message I retrain the filter. > But still, even blatant spam gets X-DSPAM-Probability: 0.0000. The > X-DSPAM-Confidence varies from 50% to 100% where the lower the > confidence, the more likely it is spam. > > This can't be normal -- is dspam in some training mode or something? > Also, I turned on show factors in my configuration, in case this is > helpful blow are the factors of a 53% confidence, 0% probability > blatant spam message I got: > > X-Original-To*vescentphotonics.com, 0.00313, > Received*vescentphotonics.com>, 0.00447, Received*2010+13, > 0.00479, Received*2010+13, 0.00479, > Received*by+mail.vescentphotonics.com, 0.00533, > Received*mail.vescentphotonics.com, 0.00533, > Received*mail.vescentphotonics.com+(Postfix), 0.00534, > X-Original-To*bugreporter, 0.00691, Date*2010, 0.00938, > Received*for+<bugreporter, 0.00944, Received*<bugreporter, > 0.00944, Received*2010, 0.00999, Received*2010, 0.00999, > Content-Type*1251", 0.99000, X-Greylist*45, 0.01000, DEAR, > 0.99000, X-MimeOLE*MimeOLE+V6.00.2600.0000, 0.99000, aside, > 0.99000, X-Mailer*Express+6.00.2600.0000, 0.99000, operation+to, > 0.99000, the+deceased, 0.99000, consent, 0.99000, await+your, > 0.99000, set+aside, 0.99000, Date*2010+13, 0.01000, > this+transaction, 0.99000, this+transaction, 0.99000 > > Thanks, > > Ben > > ------------------------------------------------------------------------------ > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > _______________________________________________ > Dspam-user mailing list > Dspam-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspam-user > ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user