Re: [Dspam-user] Understanding classification: dspam factors?

Tom Hendrikx Tue, 26 Apr 2011 04:54:10 -0700

On 26/04/11 11:46, ste...@bajic.ch wrote:
> On 23/04/11 12:24, Stevan BajiÄ! wrote:
>>>> On Fri, 22 Apr 2011 10:17:17 +0200
>>>> Tom Hendrikx <t...@whyscream.net> wrote:
>>>>
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>>
>>>>> Hi,
>>>>>
>>>> Hello Tom,
>>>>
>>>>
>>>>> In my current setup I just received my first FP.
>>>>>
>>>> I hope it is a old setup? One FP is not a big thing.
>>>>
>>>>> Dspam is setup to add
>>>>> the dspam-factors header to classified e-mails, but after reviewing the
>>>>> data, I don't understand why dspam decided to classify the message as
>>>>> spam. Also the X-DSPAM-Improbability header has weird contents.
>>>>>
>>>>> Does the dspam_factors header contain all of the tokens used to
>>>>> classify
>>>>> the message, or only a subset of them?
>>>>>
>>>> All of them. But if you use more then one algorithm then only the first
>>>> one will be shown in X-DSPAM-Factors.
>>>>
> 
> Ah I see. I use "graham burton" now. Would there be any change in
> classification if I changed that to "burton graham" in order to see more
> factors?
> 
>> Yes. Try it. I don't think you can loose much by trying.
> 
> 
>>>>> Because the headers in the FP
>>>>> message do not explain why it happens:
>>>>>
>>>>> X-DSPAM-Result: Spam
>>>>> X-DSPAM-Processed: Fri Apr 22 01:01:29 2011
>>>>> X-DSPAM-Confidence: 0.9963
>>>>> X-DSPAM-Improbability: 1 in 26939 chance of being ham
>>>>> X-DSPAM-Probability: 1.0000
>>>>> X-DSPAM-Signature: 1,4db0b74991741873512032
>>>>> X-DSPAM-Factors: 15,
>>>>>   X-AntiAbuse*Original+#+-, 0.99649,
>>>>>   X-AntiAbuse*Caller+#+GID, 0.99649,
>>>>>   X-AntiAbuse*Sender+#+Domain, 0.99649,
>>>>>   X-AntiAbuse*please+#+it, 0.99649,
>>>>>   X-AntiAbuse*with+#+#+report, 0.99649,
>>>>>   X-AntiAbuse*to+#+abuse, 0.99649,
>>>>>   X-AntiAbuse*Primary+#+-, 0.99649,
>>>>>   X-AntiAbuse*Original+Domain, 0.99649,
>>>>>   X-AntiAbuse*GID+-, 0.99649,
>>>>>   X-AntiAbuse*Sender+#+#+-, 0.99649,
>>>>>   X-AntiAbuse*track+abuse, 0.99649,
>>>>>   X-AntiAbuse*header+was, 0.99649,
>>>>>   X-AntiAbuse*header+#+#+#+track, 0.99649,
>>>>>   X-AntiAbuse*was+#+to, 0.99649,
>>>>>   X-AntiAbuse*Originator+Caller, 0.99649
>>>>>
>>>>> According to the scoring of the listed tokens, I think this message
>>>>> should be marked as ham, not as spam.
>>>>>
>>>> I think you mix up things here.
> 
> First thing I mixed up was that I was under the impression that a high
> 'score' in the token meant a 'low spamminess'. My bad, as it's the other
> way around.
> 
>> Are you sure it is the other way around? Maybe I have currently a rim hole
>> but as far as I remember the score behind the tokens is related to
>> 'X-DSPAM-Result'. So the result is influencing what the scoring means. If
>> the result is 'Innocent' then a high score means high innocent rating.
>


As far as I know, only the dspam-confidence header can be both near-1.0
for both spam and innocent. Reviewing some dspam-factors in whitelisted
and innocent mails show 0,0100 scores for most tokens.

> 
>>>> If the result is "Spam" then the shown tokens are spam tokens. See this
>>>> old CHANGELOG entry:
>>>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>>> [20040819.0800] jonz: added X-DSPAM-Factors
>>>>
>>>> added determining factors header to emails containing a list of tokens
>>>> that
>>>> played a role in the decision. if multiple algorithms are defined, only
>>>> one
>>>> is used. if the message is spam, the factor set from an algorithm
>>>> returning
>>>> a spam result will be used.
>>>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>>>
> 
> In debug output, there are many more tokens generated than 15 or 27 (I
> count 1350 tokens in my test).
> 
>> Right. This is the tokenizer doing it's work.
> 
> 
> How are the 15 or 27 tokens that are used
> for classification selected from the larger set?
> 
>> All the generated tokens are sorted and then the most significant tokens
>> are used. Significant being in this case tokens with either a high
>> innocent or spam count. Graham takes only the 15 most significant tokens
>> and Burton takes 27 most significant tokens. Burton does not eliminate
>> double entries, while Graham does (Graham uses the 15 most significant
>> unique (for that message) tokens). 

okay that's clear :)

>> This is btw one of the hugest reason
>> why one should avoid TEFT. DSPAM is all about learning. Right? So imagine
>> you would as a kid have a teacher that would FORCE you to read and learn
>> the whole Algebra theory before you would answer any mathematical
>> question. Could you imagine this? A teacher asking:
>> Teacher: Tom? Can you say what x is in this equation: 2x + 3 = 13
>> Tom: The answer is ... [teacher breaks you here]
>> Teacher: For now I don't care about the result. I want you Tom first to
>> read this 400 pages mathematical book about Algebra and afterwards I want
>> to hear the result.
>> Tom: Sitting down 8 hours reading that 400 pages book.
>> [fast forward 8 hours]
>> Tom: Teacher? The answer is 100.
>> Teacher: WRONG! Tom! You are wrong. Now go back and read that 400 pages
>> book again.
>> Tom: Sitting down 8 hours and reading that 400 pages book.
>> [fast forward another 8 hours]
>> Tom: Teacher? I have read that book.
>> Teacher: Okay Tom. Now that you have LEARNED that 400 pages Algebra book I
>> will give you the right answer.
>> Teacher: The correct answer is x = 5.
> 
>> Now while the above example might sound logical to you since you have made
>> a error in computing x. Imagine what happens if you would have given the
>> correct answer from the beginning? Right. Your teacher would not force you
>> to read the whole Algebra book a second time. But he would still FORCE you
>> to read the 400 pages book EACH time before/after (depending on your
>> viewpoint) you give the answer.
> 
>> TOE for example would be more logical (logical in that sense that it
>> beehives much like humans do).
> 
>> TOE wrong answer:
>> Teacher: Question
>> TOE: Wrong answer
>> Teacher: TOE! Learn/correct now!
> 
>> TOE right answer:
>> Teacher: Question
>> TOE: Correct answer
>> Teacher: Good boy. Continue with next question.
> 
> 
>> TEFT on the other hand is different:
> 
>> TEFT wrong answer:
>> Teacher: Question
>> TEFT: Need first to read/learn 400 pages of Algebra.
>> TEFT: Wrong answer
>> Teacher: TEFT! Learn/correct now!
> 
>> TEFT right answer:
>> Teacher: Question
>> TEFT: Need first to read/learn 400 pages of Algebra.
>> TEFT: Correct answer
>> Teacher: Good boy. Continue with next question.
> 
> 
>> TEFT was a good way in the old days to speed up dull classifiers like WORD
>> and CHAIN. But I see you use OSB and there I would strongly suggest to not
>> use TEFT. Wipe all your data and start from fresh with TOE. I am pretty
>> sure that you will not need much corrections to get on the same accuracy
>> level that you are now with TEFT.
> 

That's a long and clear explanation, thank you :)
I always used TEFT up until now, but it is a small scale setup (I'm the
main user). The '8-hour learning time' (computation time) is no
bottleneck here, but I'll see what happens if I start fresh with OSB+TOE.

> 
> Most spammy (for
> spam-classified message) or innocent (for innocent-classified message)
> tokens?
> 
>> Yes.
> 
> 
> Hmm maybe I should read more about these algorithms... :)
> 
>> The article/wiki entry from Julien should help you understanding the
>> formulas.
> 

That would be a nice spending of a warm summer evening somewhere this
week :)

--
Regards,

        Tom

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Understanding classification: dspam factors?

Reply via email to