Re: [Dspam-user] Understanding classification: dspam factors?

Tom Hendrikx Fri, 22 Apr 2011 03:06:58 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,


Some of the answers to your questions were already in my e-mail:

- - version: git tip from 2011-03-01, commit
f02393585adca32778a176cfdf57e3bdef7b9496 according to git log
- - postfix passes mail to dspam daemon over lmtp
- - dspam is setup for for a single shared group, of which I am currently
the only user/trainer
- - I train very accurately, and as I am currently the only user, I see
all messages that are retrained. Training is done only with the
dovecot-antispam plugin for correcting FP/FN, no corpus or inoculation
is being used. Statistics for the shared group:

global:
                TP True Positives:                    39
                TN True Negatives:                  5062
                FP False Positives:                    1
                FN False Negatives:                   33
                SC Spam Corpusfed:                     0
                NC Nonspam Corpusfed:                  0
                TL Training Left:                      0
                SHR Spam Hit Rate                 54.17%
                HSR Ham Strike Rate:               0.02%
                PPV Positive predictive value:    97.50%
                OCA Overall Accuracy:             99.34%

I think my training policy is OK. I don't have a long list of
IgnoreHeaders, but that does not matter to my question at all.

However none of this answers my initial question: does dspam_factors
represent all data used for classification? And if it does: why would
dspam ever decide that the example message was spam (with an astounding
confidence)?

- --
Tom

On 22/04/11 10:36, Ibrahim Harrani wrote:
> Hi Tom,
> 
> Which dspam version you are using? How do you train? Which tokenizer
> do you use during the train and after train?
> Dspam is very sensitive about training. If you don't train very well
> or if you train too much you may have troubles.
> Also there are many headers you should ignore. You can get the list from:
> http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Working_DSPAM%2BPOSTFIX%2BMYSQL%2BCLAMAV_Setup_by_PaulC
> 
> Also if uploaded spam/ham corpus from windows to unix/linux you should
> ignore them by adding the following line to dspam.conf.
> I had this problem before, In this case dspam was only checking the
> headers like for the classification.
> 
> #Specifying 'lineStripping' causes DSPAM to strip ^M's from messages
> passed # in.
> Broken lineStripping
> 
> If you have same problem you may have to re-train your dspam data.
> 
> Thanks.
> 
> On Fri, Apr 22, 2011 at 9:17 AM, Tom Hendrikx <t...@whyscream.net> wrote:
> Hi,
> 
> In my current setup I just received my first FP. Dspam is setup to add
> the dspam-factors header to classified e-mails, but after reviewing the
> data, I don't understand why dspam decided to classify the message as
> spam. Also the X-DSPAM-Improbability header has weird contents.
> 
> Does the dspam_factors header contain all of the tokens used to classify
> the message, or only a subset of them? Because the headers in the FP
> message do not explain why it happens:
> 
> X-DSPAM-Result: Spam
> X-DSPAM-Processed: Fri Apr 22 01:01:29 2011
> X-DSPAM-Confidence: 0.9963
> X-DSPAM-Improbability: 1 in 26939 chance of being ham
> X-DSPAM-Probability: 1.0000
> X-DSPAM-Signature: 1,4db0b74991741873512032
> X-DSPAM-Factors: 15,
>        X-AntiAbuse*Original+#+-, 0.99649,
>        X-AntiAbuse*Caller+#+GID, 0.99649,
>        X-AntiAbuse*Sender+#+Domain, 0.99649,
>        X-AntiAbuse*please+#+it, 0.99649,
>        X-AntiAbuse*with+#+#+report, 0.99649,
>        X-AntiAbuse*to+#+abuse, 0.99649,
>        X-AntiAbuse*Primary+#+-, 0.99649,
>        X-AntiAbuse*Original+Domain, 0.99649,
>        X-AntiAbuse*GID+-, 0.99649,
>        X-AntiAbuse*Sender+#+#+-, 0.99649,
>        X-AntiAbuse*track+abuse, 0.99649,
>        X-AntiAbuse*header+was, 0.99649,
>        X-AntiAbuse*header+#+#+#+track, 0.99649,
>        X-AntiAbuse*was+#+to, 0.99649,
>        X-AntiAbuse*Originator+Caller, 0.99649
> 
> According to the scoring of the listed tokens, I think this message
> should be marked as ham, not as spam. Relevant values from dspam.conf:
> 
> TrainingMode teft
> ImprobabilityDrive on
> Algorithm graham burton
> Tokenizer osb
> PValue bcr
> 
> All of the above with a git tip checkout from 2011-03-01.
> 
> Kind regards,
> 
>        Tom
> 
> 
> FWIW: I added the X-AntiAbuse header to the Ignmoreheaders after
> reviewing this message, because I concluded that the header is pretty
> useless for classification.
> 
> 
>>
-
------------------------------------------------------------------------------
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been
demonstrated beyond question. Learn why your peers are replacing JEE
containers with lightweight application servers - and what you can gain
from the move. http://p.sf.net/sfu/vmware-sfemails
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user
>>

> ------------------------------------------------------------------------------
> Fulfilling the Lean Software Promise
> Lean software platforms are now widely adopted and the benefits have been 
> demonstrated beyond question. Learn why your peers are replacing JEE 
> containers with lightweight application servers - and what you can gain 
> from the move. http://p.sf.net/sfu/vmware-sfemails
> _______________________________________________
> Dspam-user mailing list
> Dspam-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspam-user



- -- 

New PGP key: 7D54EFF5
Fingerprint: C26F 374F 5E13 157B 5B42  7A1B 93DF 319D 7D54 EFF5
http://www.whyscream.net/key-transition-2011-03-30.txt.asc
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJNsVD4AAoJEJPfMZ19VO/1sCkP/RGJZwVZ9gIOFiYTR1sKfV4q
tvDl8L/oOjS13oYc7fvt7YNioceVEGe4MgWE/dWeverrttDO7kVxOWFqbUmaPUz7
9OlLfRpXQWZmV7XwtxFJ+Gk52sOux4By0G/y0BwTl2OlOdpbyzL/aOkH/2rCEwLH
UhPDTlEcIhMmAggVWOoF5esGYkIjjOZ2cp7UeyFHqTDRjZvkl9PX3xTCKwdePnW3
9x+1GyhNd/bl+nVY5xuqqqSMcb4qeyFtJ8Nn7bRgyKzB8PYgRmVU+bPXHOna7OIo
dG/74SkbIXBcTVZSYbZYFIzw9RzWaKxhBDcE09JzjsQoYanSzkzrDIVl290iCbXY
samHB1XhRFgsnnYpMsxECR7QzeqEvdLnhmgtPzZOSLFjzgGjeIQRkIy8oZOtgCt5
jzrgwby/eEl6XggiuJ/gXIBXJmmM23dxbwwaLjgkvZ7iIu2SVGYKGfcW1Xn31RkJ
k9VmaUQ4WJGfQd8q7pYBNR52M7nQxvMV+0BUim/C8Eu8zXgtf+FV6bCixWmixxZ5
cgSs59mu0TLZWq48IdWlWstNBMYzfLfO0DUSWdKdO1JdgAy4CvGmlYrqBGVaFPrF
Z26Era6cPo+t1ChrvowUPoIwKoyHXf/h/dtrqDnlwk3aD7Gy0fYt4JjaFmUBA+1k
Pp4iIHtrkG+PMv0DJnZX
=5Hso
-----END PGP SIGNATURE-----

------------------------------------------------------------------------------
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been 
demonstrated beyond question. Learn why your peers are replacing JEE 
containers with lightweight application servers - and what you can gain 
from the move. http://p.sf.net/sfu/vmware-sfemails
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Understanding classification: dspam factors?

Reply via email to