> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 23/04/11 12:24, Stevan BajiÄ wrote: >> On Fri, 22 Apr 2011 10:17:17 +0200 >> Tom Hendrikx <t...@whyscream.net> wrote: >> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Hi, >>> >> Hello Tom, >> >> >>> In my current setup I just received my first FP. >>> >> I hope it is a old setup? One FP is not a big thing. >> >>> Dspam is setup to add >>> the dspam-factors header to classified e-mails, but after reviewing the >>> data, I don't understand why dspam decided to classify the message as >>> spam. Also the X-DSPAM-Improbability header has weird contents. >>> >>> Does the dspam_factors header contain all of the tokens used to >>> classify >>> the message, or only a subset of them? >>> >> All of them. But if you use more then one algorithm then only the first >> one will be shown in X-DSPAM-Factors. >> > > Ah I see. I use "graham burton" now. Would there be any change in > classification if I changed that to "burton graham" in order to see more > factors? > Yes. Try it. I don't think you can loose much by trying.
>>> Because the headers in the FP >>> message do not explain why it happens: >>> >>> X-DSPAM-Result: Spam >>> X-DSPAM-Processed: Fri Apr 22 01:01:29 2011 >>> X-DSPAM-Confidence: 0.9963 >>> X-DSPAM-Improbability: 1 in 26939 chance of being ham >>> X-DSPAM-Probability: 1.0000 >>> X-DSPAM-Signature: 1,4db0b74991741873512032 >>> X-DSPAM-Factors: 15, >>> X-AntiAbuse*Original+#+-, 0.99649, >>> X-AntiAbuse*Caller+#+GID, 0.99649, >>> X-AntiAbuse*Sender+#+Domain, 0.99649, >>> X-AntiAbuse*please+#+it, 0.99649, >>> X-AntiAbuse*with+#+#+report, 0.99649, >>> X-AntiAbuse*to+#+abuse, 0.99649, >>> X-AntiAbuse*Primary+#+-, 0.99649, >>> X-AntiAbuse*Original+Domain, 0.99649, >>> X-AntiAbuse*GID+-, 0.99649, >>> X-AntiAbuse*Sender+#+#+-, 0.99649, >>> X-AntiAbuse*track+abuse, 0.99649, >>> X-AntiAbuse*header+was, 0.99649, >>> X-AntiAbuse*header+#+#+#+track, 0.99649, >>> X-AntiAbuse*was+#+to, 0.99649, >>> X-AntiAbuse*Originator+Caller, 0.99649 >>> >>> According to the scoring of the listed tokens, I think this message >>> should be marked as ham, not as spam. >>> >> I think you mix up things here. > > First thing I mixed up was that I was under the impression that a high > 'score' in the token meant a 'low spamminess'. My bad, as it's the other > way around. > Are you sure it is the other way around? Maybe I have currently a rim hole but as far as I remember the score behind the tokens is related to 'X-DSPAM-Result'. So the result is influencing what the scoring means. If the result is 'Innocent' then a high score means high innocent rating. >> If the result is "Spam" then the shown tokens are spam tokens. See this >> old CHANGELOG entry: >> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= >> [20040819.0800] jonz: added X-DSPAM-Factors >> >> added determining factors header to emails containing a list of tokens >> that >> played a role in the decision. if multiple algorithms are defined, only >> one >> is used. if the message is spam, the factor set from an algorithm >> returning >> a spam result will be used. >> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= >> > > In debug output, there are many more tokens generated than 15 or 27 (I > count 1350 tokens in my test). > Right. This is the tokenizer doing it's work. > How are the 15 or 27 tokens that are used > for classification selected from the larger set? > All the generated tokens are sorted and then the most significant tokens are used. Significant being in this case tokens with either a high innocent or spam count. Graham takes only the 15 most significant tokens and Burton takes 27 most significant tokens. Burton does not eliminate double entries, while Graham does (Graham uses the 15 most significant unique (for that message) tokens). This is btw one of the hugest reason why one should avoid TEFT. DSPAM is all about learning. Right? So imagine you would as a kid have a teacher that would FORCE you to read and learn the whole Algebra theory before you would answer any mathematical question. Could you imagine this? A teacher asking: Teacher: Tom? Can you say what x is in this equation: 2x + 3 = 13 Tom: The answer is ... [teacher breaks you here] Teacher: For now I don't care about the result. I want you Tom first to read this 400 pages mathematical book about Algebra and afterwards I want to hear the result. Tom: Sitting down 8 hours reading that 400 pages book. [fast forward 8 hours] Tom: Teacher? The answer is 100. Teacher: WRONG! Tom! You are wrong. Now go back and read that 400 pages book again. Tom: Sitting down 8 hours and reading that 400 pages book. [fast forward another 8 hours] Tom: Teacher? I have read that book. Teacher: Okay Tom. Now that you have LEARNED that 400 pages Algebra book I will give you the right answer. Teacher: The correct answer is x = 5. Now while the above example might sound logical to you since you have made a error in computing x. Imagine what happens if you would have given the correct answer from the beginning? Right. Your teacher would not force you to read the whole Algebra book a second time. But he would still FORCE you to read the 400 pages book EACH time before/after (depending on your viewpoint) you give the answer. TOE for example would be more logical (logical in that sense that it beehives much like humans do). TOE wrong answer: Teacher: Question TOE: Wrong answer Teacher: TOE! Learn/correct now! TOE right answer: Teacher: Question TOE: Correct answer Teacher: Good boy. Continue with next question. TEFT on the other hand is different: TEFT wrong answer: Teacher: Question TEFT: Need first to read/learn 400 pages of Algebra. TEFT: Wrong answer Teacher: TEFT! Learn/correct now! TEFT right answer: Teacher: Question TEFT: Need first to read/learn 400 pages of Algebra. TEFT: Correct answer Teacher: Good boy. Continue with next question. TEFT was a good way in the old days to speed up dull classifiers like WORD and CHAIN. But I see you use OSB and there I would strongly suggest to not use TEFT. Wipe all your data and start from fresh with TOE. I am pretty sure that you will not need much corrections to get on the same accuracy level that you are now with TEFT. > Most spammy (for > spam-classified message) or innocent (for innocent-classified message) > tokens? > Yes. > Hmm maybe I should read more about these algorithms... :) > The article/wiki entry from Julien should help you understanding the formulas. > >> >>> Relevant values from dspam.conf: >> >>> TrainingMode teft >>> ImprobabilityDrive on >>> Algorithm graham burton >>> >> The 15 factors you see in your mail are the one from Graham. Burton >> would produce 27. >> >> >>> Tokenizer osb >>> PValue bcr >>> >>> All of the above with a git tip checkout from 2011-03-01. >>> >>> Kind regards, >>> >>> Tom >>> > > > - -- > > New PGP key: 7D54EFF5 > Fingerprint: C26F 374F 5E13 157B 5B42 7A1B 93DF 319D 7D54 EFF5 > http://www.whyscream.net/key-transition-2011-03-30.txt.asc > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQIcBAEBAgAGBQJNtnZTAAoJEJPfMZ19VO/1wgcQAJyEjpd6EPapEQFWpLSxijPT > ibn3LyYfotY3dTLbWqgRzcI5+WkIXau8xjb63ZnfLckW/pzoKsXGbTMQcoik7N+Z > qc17jvjxRIX0m4bA7gO5yZaywADpO08YmsO+oUTO1Juv+rVXxJaokQrQKgYEa6Ui > ZQC/pAJora0vL5flfaOPZZ6fkb/J60VHQBCSRRUzM+b3MEdoQnbnBXLi1VlJDs04 > hTyUI4LT7xFaEGS8KrYBYzRp/ioQ88VJwpCU9WFcndLjpwBqtbVQujRQxLFROy/z > C8kxyTBVbhmcs538D0AFobRMqU7vmviflYEfdbIsI4r0nqxxI+ww8z61D37axylD > QkaCNAX+mGI4b8QO6451pkHq0lM27YKRWkGAh0LkB+8wQ4VlC0W84Ygt1/BiFNMp > kQhcwVAdjQsd2KYNdH3PPCPbOKh5D3o2psvKA2N7EwXxRO48O6Y9fzeEPL9qYLBf > hiOQn5stBo1I8KQs17XhaeRWvJBFd8xPlcGaw+qaimJJFbCeQRjaxC34xb9GF805 > k0+R+irexgesaYFQZ1fgKjRTrcVWJZx8+C9HnlNu4R2u+93NtE5of4EJGTutPZvy > ouTBS8X7pkk6TZHDc7rR+j8KOOc/nkObKnF1Li18F32dZG3g2DbSiT0+Gnfek156 > KMNmLBVC+tSGbh/CFEkC > =ymjm > -----END PGP SIGNATURE----- > > ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user