Re: [Dspam-user] Understanding classification: dspam factors?

stevan Tue, 26 Apr 2011 05:13:06 -0700

> On 26/04/11 11:46, ste...@bajic.ch wrote:
>> On 23/04/11 12:24, Stevan BajiÄ! wrote:
>>>>> On Fri, 22 Apr 2011 10:17:17 +0200
>>>>> Tom Hendrikx <t...@whyscream.net> wrote:
>>>>>
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA1
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>> Hello Tom,
>>>>>
>>>>>
>>>>>> In my current setup I just received my first FP.
>>>>>>
>>>>> I hope it is a old setup? One FP is not a big thing.
>>>>>
>>>>>> Dspam is setup to add
>>>>>> the dspam-factors header to classified e-mails, but after reviewing
>>>>>> the
>>>>>> data, I don't understand why dspam decided to classify the message
>>>>>> as
>>>>>> spam. Also the X-DSPAM-Improbability header has weird contents.
>>>>>>
>>>>>> Does the dspam_factors header contain all of the tokens used to
>>>>>> classify
>>>>>> the message, or only a subset of them?
>>>>>>
>>>>> All of them. But if you use more then one algorithm then only the
>>>>> first
>>>>> one will be shown in X-DSPAM-Factors.
>>>>>
>>
>> Ah I see. I use "graham burton" now. Would there be any change in
>> classification if I changed that to "burton graham" in order to see more
>> factors?
>>
>>> Yes. Try it. I don't think you can loose much by trying.
>>
>>
>>>>>> Because the headers in the FP
>>>>>> message do not explain why it happens:
>>>>>>
>>>>>> X-DSPAM-Result: Spam
>>>>>> X-DSPAM-Processed: Fri Apr 22 01:01:29 2011
>>>>>> X-DSPAM-Confidence: 0.9963
>>>>>> X-DSPAM-Improbability: 1 in 26939 chance of being ham
>>>>>> X-DSPAM-Probability: 1.0000
>>>>>> X-DSPAM-Signature: 1,4db0b74991741873512032
>>>>>> X-DSPAM-Factors: 15,
>>>>>>  X-AntiAbuse*Original+#+-, 0.99649,
>>>>>>  X-AntiAbuse*Caller+#+GID, 0.99649,
>>>>>>  X-AntiAbuse*Sender+#+Domain, 0.99649,
>>>>>>  X-AntiAbuse*please+#+it, 0.99649,
>>>>>>  X-AntiAbuse*with+#+#+report, 0.99649,
>>>>>>  X-AntiAbuse*to+#+abuse, 0.99649,
>>>>>>  X-AntiAbuse*Primary+#+-, 0.99649,
>>>>>>  X-AntiAbuse*Original+Domain, 0.99649,
>>>>>>  X-AntiAbuse*GID+-, 0.99649,
>>>>>>  X-AntiAbuse*Sender+#+#+-, 0.99649,
>>>>>>  X-AntiAbuse*track+abuse, 0.99649,
>>>>>>  X-AntiAbuse*header+was, 0.99649,
>>>>>>  X-AntiAbuse*header+#+#+#+track, 0.99649,
>>>>>>  X-AntiAbuse*was+#+to, 0.99649,
>>>>>>  X-AntiAbuse*Originator+Caller, 0.99649
>>>>>>
>>>>>> According to the scoring of the listed tokens, I think this message
>>>>>> should be marked as ham, not as spam.
>>>>>>
>>>>> I think you mix up things here.
>>
>> First thing I mixed up was that I was under the impression that a high
>> 'score' in the token meant a 'low spamminess'. My bad, as it's the other
>> way around.
>>
>>> Are you sure it is the other way around? Maybe I have currently a rim
>>> hole
>>> but as far as I remember the score behind the tokens is related to
>>> 'X-DSPAM-Result'. So the result is influencing what the scoring means.
>>> If
>>> the result is 'Innocent' then a high score means high innocent rating.
>>
>
> As far as I know, only the dspam-confidence header can be both near-1.0
> for both spam and innocent. Reviewing some dspam-factors in whitelisted
> and innocent mails show 0,0100 scores for most tokens.
>
>>
>>>>> If the result is "Spam" then the shown tokens are spam tokens. See
>>>>> this
>>>>> old CHANGELOG entry:
>>>>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>>>> [20040819.0800] jonz: added X-DSPAM-Factors
>>>>>
>>>>> added determining factors header to emails containing a list of
>>>>> tokens
>>>>> that
>>>>> played a role in the decision. if multiple algorithms are defined,
>>>>> only
>>>>> one
>>>>> is used. if the message is spam, the factor set from an algorithm
>>>>> returning
>>>>> a spam result will be used.
>>>>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>>>>
>>
>> In debug output, there are many more tokens generated than 15 or 27 (I
>> count 1350 tokens in my test).
>>
>>> Right. This is the tokenizer doing it's work.
>>
>>
>> How are the 15 or 27 tokens that are used
>> for classification selected from the larger set?
>>
>>> All the generated tokens are sorted and then the most significant
>>> tokens
>>> are used. Significant being in this case tokens with either a high
>>> innocent or spam count. Graham takes only the 15 most significant
>>> tokens
>>> and Burton takes 27 most significant tokens. Burton does not eliminate
>>> double entries, while Graham does (Graham uses the 15 most significant
>>> unique (for that message) tokens).
>
> okay that's clear :)
>
>>> This is btw one of the hugest reason
>>> why one should avoid TEFT. DSPAM is all about learning. Right? So
>>> imagine
>>> you would as a kid have a teacher that would FORCE you to read and
>>> learn
>>> the whole Algebra theory before you would answer any mathematical
>>> question. Could you imagine this? A teacher asking:
>>> Teacher: Tom? Can you say what x is in this equation: 2x + 3 = 13
>>> Tom: The answer is ... [teacher breaks you here]
>>> Teacher: For now I don't care about the result. I want you Tom first to
>>> read this 400 pages mathematical book about Algebra and afterwards I
>>> want
>>> to hear the result.
>>> Tom: Sitting down 8 hours reading that 400 pages book.
>>> [fast forward 8 hours]
>>> Tom: Teacher? The answer is 100.
>>> Teacher: WRONG! Tom! You are wrong. Now go back and read that 400 pages
>>> book again.
>>> Tom: Sitting down 8 hours and reading that 400 pages book.
>>> [fast forward another 8 hours]
>>> Tom: Teacher? I have read that book.
>>> Teacher: Okay Tom. Now that you have LEARNED that 400 pages Algebra
>>> book I
>>> will give you the right answer.
>>> Teacher: The correct answer is x = 5.
>>
>>> Now while the above example might sound logical to you since you have
>>> made
>>> a error in computing x. Imagine what happens if you would have given
>>> the
>>> correct answer from the beginning? Right. Your teacher would not force
>>> you
>>> to read the whole Algebra book a second time. But he would still FORCE
>>> you
>>> to read the 400 pages book EACH time before/after (depending on your
>>> viewpoint) you give the answer.
>>
>>> TOE for example would be more logical (logical in that sense that it
>>> beehives much like humans do).
>>
>>> TOE wrong answer:
>>> Teacher: Question
>>> TOE: Wrong answer
>>> Teacher: TOE! Learn/correct now!
>>
>>> TOE right answer:
>>> Teacher: Question
>>> TOE: Correct answer
>>> Teacher: Good boy. Continue with next question.
>>
>>
>>> TEFT on the other hand is different:
>>
>>> TEFT wrong answer:
>>> Teacher: Question
>>> TEFT: Need first to read/learn 400 pages of Algebra.
>>> TEFT: Wrong answer
>>> Teacher: TEFT! Learn/correct now!
>>
>>> TEFT right answer:
>>> Teacher: Question
>>> TEFT: Need first to read/learn 400 pages of Algebra.
>>> TEFT: Correct answer
>>> Teacher: Good boy. Continue with next question.
>>
>>
>>> TEFT was a good way in the old days to speed up dull classifiers like
>>> WORD
>>> and CHAIN. But I see you use OSB and there I would strongly suggest to
>>> not
>>> use TEFT. Wipe all your data and start from fresh with TOE. I am pretty
>>> sure that you will not need much corrections to get on the same
>>> accuracy
>>> level that you are now with TEFT.
>>
>
> That's a long and clear explanation, thank you :)
> I always used TEFT up until now, but it is a small scale setup (I'm the
> main user). The '8-hour learning time' (computation time) is no
> bottleneck here,
>
Aggrrr... my explanation was not good enough. It's not only about the 8
hour learning time. It is about HOW the learning is done. How can I
explain differently? Imagine something simple as learning to compute 2 +
2. Now each time I ask you how much 2+2 is... you don't just give me the
answer but you learn again that 2+2 is 4. That would be TEFT. TOE will NOT
learn on correct answers. So if I ask you how much is 2+2 and you give me
the answer 4 then I am not going to tell you that you need again to learn
that 2+2=4. All I will do is say that the answer is correct and ask you
the next question... how much is 2*4? And if you give me the answer 8 then
I am happy and will continue with the next question. How much is 2*3? And
if you now answer 9 then I will say: Tom. That is wrong. 2*3 is 6.


This is how TOE works.

TEFT on the other hand is always pushing stuff into your mind that you
already answered correctly and that you already know. So what is the point
in learning again that 2+2 is 4 if you already know that 2+2 is 4?
Obviously you will tell me that there is no point in learning that again.
Right? But with TEFT you are EXACTLY doing this. You learn and learn and
learn and relearn again and again and again the same old stuff that you
already learned and that you already know.

This is just one of the bad parts of TEFT.

The other is that the automatic learning nature of TEFT is decreasing
accuracy IF the DSPAM end user is NOT correcting errors. Each time DSPAM
gives an answer for the question: Spam/Ham?
It not only gives an answer but also learns that OWN GIVEN ANSWER. So not
correcting errors in TEFT is an accelerator for more and more inaccurate
results.

TOE does not have this negative side effect. Off course TOE accuracy
decreases too if you don't correct errors. But much slower than with TEFT.
TOE decrease in accuracy is mainly because of SPAM/HAM total processed
counters while in TEFT it is because of SPAM/HAM total processed counter
AND because of the changed {spam|innocent}_hits on the tokens.


> but I'll see what happens if I start fresh with OSB+TOE.
>
>>
>> Most spammy (for
>> spam-classified message) or innocent (for innocent-classified message)
>> tokens?
>>
>>> Yes.
>>
>>
>> Hmm maybe I should read more about these algorithms... :)
>>
>>> The article/wiki entry from Julien should help you understanding the
>>> formulas.
>>
>
> That would be a nice spending of a warm summer evening somewhere this
> week :)
>
> --
> Regards,
>
>       Tom
>
> ------------------------------------------------------------------------------
> WhatsUp Gold - Download Free Network Management Software
> The most intuitive, comprehensive, and cost-effective network
> management toolset available today.  Delivers lowest initial
> acquisition cost and overall TCO of any competing solution.
> http://p.sf.net/sfu/whatsupgold-sd
> _______________________________________________
> Dspam-user mailing list
> Dspam-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspam-user
>
>



------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Understanding classification: dspam factors?

Reply via email to