On 26/04/11 11:46, ste...@bajic.ch wrote: > On 23/04/11 12:24, Stevan BajiÄ! wrote: >>>> On Fri, 22 Apr 2011 10:17:17 +0200 >>>> Tom Hendrikx <t...@whyscream.net> wrote: >>>> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>> Hi, >>>>> >>>> Hello Tom, >>>> >>>> >>>>> In my current setup I just received my first FP. >>>>> >>>> I hope it is a old setup? One FP is not a big thing. >>>> >>>>> Dspam is setup to add >>>>> the dspam-factors header to classified e-mails, but after reviewing the >>>>> data, I don't understand why dspam decided to classify the message as >>>>> spam. Also the X-DSPAM-Improbability header has weird contents. >>>>> >>>>> Does the dspam_factors header contain all of the tokens used to >>>>> classify >>>>> the message, or only a subset of them? >>>>> >>>> All of them. But if you use more then one algorithm then only the first >>>> one will be shown in X-DSPAM-Factors. >>>> > > Ah I see. I use "graham burton" now. Would there be any change in > classification if I changed that to "burton graham" in order to see more > factors? > >> Yes. Try it. I don't think you can loose much by trying. > > >>>>> Because the headers in the FP >>>>> message do not explain why it happens: >>>>> >>>>> X-DSPAM-Result: Spam >>>>> X-DSPAM-Processed: Fri Apr 22 01:01:29 2011 >>>>> X-DSPAM-Confidence: 0.9963 >>>>> X-DSPAM-Improbability: 1 in 26939 chance of being ham >>>>> X-DSPAM-Probability: 1.0000 >>>>> X-DSPAM-Signature: 1,4db0b74991741873512032 >>>>> X-DSPAM-Factors: 15, >>>>> X-AntiAbuse*Original+#+-, 0.99649, >>>>> X-AntiAbuse*Caller+#+GID, 0.99649, >>>>> X-AntiAbuse*Sender+#+Domain, 0.99649, >>>>> X-AntiAbuse*please+#+it, 0.99649, >>>>> X-AntiAbuse*with+#+#+report, 0.99649, >>>>> X-AntiAbuse*to+#+abuse, 0.99649, >>>>> X-AntiAbuse*Primary+#+-, 0.99649, >>>>> X-AntiAbuse*Original+Domain, 0.99649, >>>>> X-AntiAbuse*GID+-, 0.99649, >>>>> X-AntiAbuse*Sender+#+#+-, 0.99649, >>>>> X-AntiAbuse*track+abuse, 0.99649, >>>>> X-AntiAbuse*header+was, 0.99649, >>>>> X-AntiAbuse*header+#+#+#+track, 0.99649, >>>>> X-AntiAbuse*was+#+to, 0.99649, >>>>> X-AntiAbuse*Originator+Caller, 0.99649 >>>>> >>>>> According to the scoring of the listed tokens, I think this message >>>>> should be marked as ham, not as spam. >>>>> >>>> I think you mix up things here. > > First thing I mixed up was that I was under the impression that a high > 'score' in the token meant a 'low spamminess'. My bad, as it's the other > way around. > >> Are you sure it is the other way around? Maybe I have currently a rim hole >> but as far as I remember the score behind the tokens is related to >> 'X-DSPAM-Result'. So the result is influencing what the scoring means. If >> the result is 'Innocent' then a high score means high innocent rating. >
As far as I know, only the dspam-confidence header can be both near-1.0 for both spam and innocent. Reviewing some dspam-factors in whitelisted and innocent mails show 0,0100 scores for most tokens. > >>>> If the result is "Spam" then the shown tokens are spam tokens. See this >>>> old CHANGELOG entry: >>>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= >>>> [20040819.0800] jonz: added X-DSPAM-Factors >>>> >>>> added determining factors header to emails containing a list of tokens >>>> that >>>> played a role in the decision. if multiple algorithms are defined, only >>>> one >>>> is used. if the message is spam, the factor set from an algorithm >>>> returning >>>> a spam result will be used. >>>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= >>>> > > In debug output, there are many more tokens generated than 15 or 27 (I > count 1350 tokens in my test). > >> Right. This is the tokenizer doing it's work. > > > How are the 15 or 27 tokens that are used > for classification selected from the larger set? > >> All the generated tokens are sorted and then the most significant tokens >> are used. Significant being in this case tokens with either a high >> innocent or spam count. Graham takes only the 15 most significant tokens >> and Burton takes 27 most significant tokens. Burton does not eliminate >> double entries, while Graham does (Graham uses the 15 most significant >> unique (for that message) tokens). okay that's clear :) >> This is btw one of the hugest reason >> why one should avoid TEFT. DSPAM is all about learning. Right? So imagine >> you would as a kid have a teacher that would FORCE you to read and learn >> the whole Algebra theory before you would answer any mathematical >> question. Could you imagine this? A teacher asking: >> Teacher: Tom? Can you say what x is in this equation: 2x + 3 = 13 >> Tom: The answer is ... [teacher breaks you here] >> Teacher: For now I don't care about the result. I want you Tom first to >> read this 400 pages mathematical book about Algebra and afterwards I want >> to hear the result. >> Tom: Sitting down 8 hours reading that 400 pages book. >> [fast forward 8 hours] >> Tom: Teacher? The answer is 100. >> Teacher: WRONG! Tom! You are wrong. Now go back and read that 400 pages >> book again. >> Tom: Sitting down 8 hours and reading that 400 pages book. >> [fast forward another 8 hours] >> Tom: Teacher? I have read that book. >> Teacher: Okay Tom. Now that you have LEARNED that 400 pages Algebra book I >> will give you the right answer. >> Teacher: The correct answer is x = 5. > >> Now while the above example might sound logical to you since you have made >> a error in computing x. Imagine what happens if you would have given the >> correct answer from the beginning? Right. Your teacher would not force you >> to read the whole Algebra book a second time. But he would still FORCE you >> to read the 400 pages book EACH time before/after (depending on your >> viewpoint) you give the answer. > >> TOE for example would be more logical (logical in that sense that it >> beehives much like humans do). > >> TOE wrong answer: >> Teacher: Question >> TOE: Wrong answer >> Teacher: TOE! Learn/correct now! > >> TOE right answer: >> Teacher: Question >> TOE: Correct answer >> Teacher: Good boy. Continue with next question. > > >> TEFT on the other hand is different: > >> TEFT wrong answer: >> Teacher: Question >> TEFT: Need first to read/learn 400 pages of Algebra. >> TEFT: Wrong answer >> Teacher: TEFT! Learn/correct now! > >> TEFT right answer: >> Teacher: Question >> TEFT: Need first to read/learn 400 pages of Algebra. >> TEFT: Correct answer >> Teacher: Good boy. Continue with next question. > > >> TEFT was a good way in the old days to speed up dull classifiers like WORD >> and CHAIN. But I see you use OSB and there I would strongly suggest to not >> use TEFT. Wipe all your data and start from fresh with TOE. I am pretty >> sure that you will not need much corrections to get on the same accuracy >> level that you are now with TEFT. > That's a long and clear explanation, thank you :) I always used TEFT up until now, but it is a small scale setup (I'm the main user). The '8-hour learning time' (computation time) is no bottleneck here, but I'll see what happens if I start fresh with OSB+TOE. > > Most spammy (for > spam-classified message) or innocent (for innocent-classified message) > tokens? > >> Yes. > > > Hmm maybe I should read more about these algorithms... :) > >> The article/wiki entry from Julien should help you understanding the >> formulas. > That would be a nice spending of a warm summer evening somewhere this week :) -- Regards, Tom ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user