Re: [Dspam-user] Understanding classification: dspam factors?

stevan Tue, 26 Apr 2011 02:56:07 -0700

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 23/04/11 12:24, Stevan BajiÄ wrote:
>> On Fri, 22 Apr 2011 10:17:17 +0200
>> Tom Hendrikx <t...@whyscream.net> wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Hi,
>>>
>> Hello Tom,
>>
>>
>>> In my current setup I just received my first FP.
>>>
>> I hope it is a old setup? One FP is not a big thing.
>>
>>> Dspam is setup to add
>>> the dspam-factors header to classified e-mails, but after reviewing the
>>> data, I don't understand why dspam decided to classify the message as
>>> spam. Also the X-DSPAM-Improbability header has weird contents.
>>>
>>> Does the dspam_factors header contain all of the tokens used to
>>> classify
>>> the message, or only a subset of them?
>>>
>> All of them. But if you use more then one algorithm then only the first
>> one will be shown in X-DSPAM-Factors.
>>
>
> Ah I see. I use "graham burton" now. Would there be any change in
> classification if I changed that to "burton graham" in order to see more
> factors?
>
Yes. Try it. I don't think you can loose much by trying.



>>> Because the headers in the FP
>>> message do not explain why it happens:
>>>
>>> X-DSPAM-Result: Spam
>>> X-DSPAM-Processed: Fri Apr 22 01:01:29 2011
>>> X-DSPAM-Confidence: 0.9963
>>> X-DSPAM-Improbability: 1 in 26939 chance of being ham
>>> X-DSPAM-Probability: 1.0000
>>> X-DSPAM-Signature: 1,4db0b74991741873512032
>>> X-DSPAM-Factors: 15,
>>>     X-AntiAbuse*Original+#+-, 0.99649,
>>>     X-AntiAbuse*Caller+#+GID, 0.99649,
>>>     X-AntiAbuse*Sender+#+Domain, 0.99649,
>>>     X-AntiAbuse*please+#+it, 0.99649,
>>>     X-AntiAbuse*with+#+#+report, 0.99649,
>>>     X-AntiAbuse*to+#+abuse, 0.99649,
>>>     X-AntiAbuse*Primary+#+-, 0.99649,
>>>     X-AntiAbuse*Original+Domain, 0.99649,
>>>     X-AntiAbuse*GID+-, 0.99649,
>>>     X-AntiAbuse*Sender+#+#+-, 0.99649,
>>>     X-AntiAbuse*track+abuse, 0.99649,
>>>     X-AntiAbuse*header+was, 0.99649,
>>>     X-AntiAbuse*header+#+#+#+track, 0.99649,
>>>     X-AntiAbuse*was+#+to, 0.99649,
>>>     X-AntiAbuse*Originator+Caller, 0.99649
>>>
>>> According to the scoring of the listed tokens, I think this message
>>> should be marked as ham, not as spam.
>>>
>> I think you mix up things here.
>
> First thing I mixed up was that I was under the impression that a high
> 'score' in the token meant a 'low spamminess'. My bad, as it's the other
> way around.
>
Are you sure it is the other way around? Maybe I have currently a rim hole
but as far as I remember the score behind the tokens is related to
'X-DSPAM-Result'. So the result is influencing what the scoring means. If
the result is 'Innocent' then a high score means high innocent rating.


>> If the result is "Spam" then the shown tokens are spam tokens. See this
>> old CHANGELOG entry:
>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>> [20040819.0800] jonz: added X-DSPAM-Factors
>>
>> added determining factors header to emails containing a list of tokens
>> that
>> played a role in the decision. if multiple algorithms are defined, only
>> one
>> is used. if the message is spam, the factor set from an algorithm
>> returning
>> a spam result will be used.
>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>
>
> In debug output, there are many more tokens generated than 15 or 27 (I
> count 1350 tokens in my test).
>
Right. This is the tokenizer doing it's work.


> How are the 15 or 27 tokens that are used
> for classification selected from the larger set?
>
All the generated tokens are sorted and then the most significant tokens
are used. Significant being in this case tokens with either a high
innocent or spam count. Graham takes only the 15 most significant tokens
and Burton takes 27 most significant tokens. Burton does not eliminate
double entries, while Graham does (Graham uses the 15 most significant
unique (for that message) tokens). This is btw one of the hugest reason
why one should avoid TEFT. DSPAM is all about learning. Right? So imagine
you would as a kid have a teacher that would FORCE you to read and learn
the whole Algebra theory before you would answer any mathematical
question. Could you imagine this? A teacher asking:
Teacher: Tom? Can you say what x is in this equation: 2x + 3 = 13
Tom: The answer is ... [teacher breaks you here]
Teacher: For now I don't care about the result. I want you Tom first to
read this 400 pages mathematical book about Algebra and afterwards I want
to hear the result.
Tom: Sitting down 8 hours reading that 400 pages book.
[fast forward 8 hours]
Tom: Teacher? The answer is 100.
Teacher: WRONG! Tom! You are wrong. Now go back and read that 400 pages
book again.
Tom: Sitting down 8 hours and reading that 400 pages book.
[fast forward another 8 hours]
Tom: Teacher? I have read that book.
Teacher: Okay Tom. Now that you have LEARNED that 400 pages Algebra book I
will give you the right answer.
Teacher: The correct answer is x = 5.

Now while the above example might sound logical to you since you have made
a error in computing x. Imagine what happens if you would have given the
correct answer from the beginning? Right. Your teacher would not force you
to read the whole Algebra book a second time. But he would still FORCE you
to read the 400 pages book EACH time before/after (depending on your
viewpoint) you give the answer.

TOE for example would be more logical (logical in that sense that it
beehives much like humans do).

TOE wrong answer:
Teacher: Question
TOE: Wrong answer
Teacher: TOE! Learn/correct now!

TOE right answer:
Teacher: Question
TOE: Correct answer
Teacher: Good boy. Continue with next question.


TEFT on the other hand is different:

TEFT wrong answer:
Teacher: Question
TEFT: Need first to read/learn 400 pages of Algebra.
TEFT: Wrong answer
Teacher: TEFT! Learn/correct now!

TEFT right answer:
Teacher: Question
TEFT: Need first to read/learn 400 pages of Algebra.
TEFT: Correct answer
Teacher: Good boy. Continue with next question.


TEFT was a good way in the old days to speed up dull classifiers like WORD
and CHAIN. But I see you use OSB and there I would strongly suggest to not
use TEFT. Wipe all your data and start from fresh with TOE. I am pretty
sure that you will not need much corrections to get on the same accuracy
level that you are now with TEFT.


> Most spammy (for
> spam-classified message) or innocent (for innocent-classified message)
> tokens?
>
Yes.


> Hmm maybe I should read more about these algorithms... :)
>
The article/wiki entry from Julien should help you understanding the
formulas.


>
>>
>>> Relevant values from dspam.conf:
>>
>>> TrainingMode teft
>>> ImprobabilityDrive on
>>> Algorithm graham burton
>>>
>> The 15 factors you see in your mail are the one from Graham. Burton
>> would produce 27.
>>
>>
>>> Tokenizer osb
>>> PValue bcr
>>>
>>> All of the above with a git tip checkout from 2011-03-01.
>>>
>>> Kind regards,
>>>
>>>     Tom
>>>
>
>
> - --
>
> New PGP key: 7D54EFF5
> Fingerprint: C26F 374F 5E13 157B 5B42  7A1B 93DF 319D 7D54 EFF5
> http://www.whyscream.net/key-transition-2011-03-30.txt.asc
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQIcBAEBAgAGBQJNtnZTAAoJEJPfMZ19VO/1wgcQAJyEjpd6EPapEQFWpLSxijPT
> ibn3LyYfotY3dTLbWqgRzcI5+WkIXau8xjb63ZnfLckW/pzoKsXGbTMQcoik7N+Z
> qc17jvjxRIX0m4bA7gO5yZaywADpO08YmsO+oUTO1Juv+rVXxJaokQrQKgYEa6Ui
> ZQC/pAJora0vL5flfaOPZZ6fkb/J60VHQBCSRRUzM+b3MEdoQnbnBXLi1VlJDs04
> hTyUI4LT7xFaEGS8KrYBYzRp/ioQ88VJwpCU9WFcndLjpwBqtbVQujRQxLFROy/z
> C8kxyTBVbhmcs538D0AFobRMqU7vmviflYEfdbIsI4r0nqxxI+ww8z61D37axylD
> QkaCNAX+mGI4b8QO6451pkHq0lM27YKRWkGAh0LkB+8wQ4VlC0W84Ygt1/BiFNMp
> kQhcwVAdjQsd2KYNdH3PPCPbOKh5D3o2psvKA2N7EwXxRO48O6Y9fzeEPL9qYLBf
> hiOQn5stBo1I8KQs17XhaeRWvJBFd8xPlcGaw+qaimJJFbCeQRjaxC34xb9GF805
> k0+R+irexgesaYFQZ1fgKjRTrcVWJZx8+C9HnlNu4R2u+93NtE5of4EJGTutPZvy
> ouTBS8X7pkk6TZHDc7rR+j8KOOc/nkObKnF1Li18F32dZG3g2DbSiT0+Gnfek156
> KMNmLBVC+tSGbh/CFEkC
> =ymjm
> -----END PGP SIGNATURE-----
>
> ------------------------------------------------------------------------------



------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Understanding classification: dspam factors?

Reply via email to