Re: Bayes autolearn questions

Karsten Bräckelmann Tue, 02 Sep 2014 19:22:16 -0700

On Tue, 2014-09-02 at 21:11 -0400, Alex wrote:
> I have a spamassassin-3.4 system with the following bayes config:
> 
> required_hits 5.0
> rbl_timeout 8
> use_bayes 1
> bayes_auto_learn 1
> bayes_auto_learn_on_error 1
> bayes_auto_learn_threshold_spam 9.0
> bayes_expiry_max_db_size 9500000
> bayes_auto_expire 0
> 
> However, spam with scores greater than 9.0 aren't being autolearned:


http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html


> Sep  2 21:01:51 mail01 amavis[25938]: (25938-10)
> header_edits_for_quar: <bmu011...@bmu-011.hichina.com> ->
> <bestd...@example.com>, Yes, score=16.519 tag=-200 tag2=5 kill=5
> tests=[BAYES_50=0.8, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LINKBAIT=5,
> LOC_DOT_SUBJ=0.1, LOC_SHORT=3.1, RCVD_IN_BL_SPAMCOP_NET=1.347,
> RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_PSBL=2.3,
> RCVD_IN_UCEPROTECT1=0.01, RCVD_IN_UCEPROTECT2=0.01, RDNS_NONE=0.793,
> RELAYCOUNTRY_CN=0.1, RELAYCOUNTRY_HIGH=0.5, SAGREY=0.01] autolearn=no
> autolearn_force=no
> 
> I've re-read the autolearn section of the docs,

The one I linked to above?

> and don't see any reason why this 16-point email wouldn't have any new
> tokens to be learned?

Rules with certain tflags are ignored when determining whether a message
should be trained upon. Most notably here BAYES_xx.

Moreover, the auto-learning decision occurs using scores from either
scoreset 0 or 1, that is using scores of a non-Bayes scoreset. IOW the
message's score of 16 is irrelevant, since the auto-learn algorithm uses
different scores per rule.

Next safety net is requiring at least 3 points each from header and body
rules, unless autolearn_force is enabled. Which it is not in your
sample.

Either of those could have prevented auto-learning.


Also, according to your wording, you seem to think in terms of (number
of) "new tokens to be learned". Which has nothing in common with
auto-learning.

(Even worse, "new tokens" would strongly apply to random gibberish
strings, hapaxes in Bayes context. Which are commonly ignored in Bayes
classification.)


> I looked in the quarantined message, and according to the _TOKEN_
> header I've added:
> 
> X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16.
> 
> Isn't that sufficient for auto-learning this message as spam?

That has absolutely nothing to do with auto-learning. Where did you get
the impression it might?


> I just wanted to be sure this is just a case of not enough new points
> (tokens?) for the message to be learned, and that I I wasn't doing
> something wrong.

Points: aka score, used in the context of per-rule (per-test) and
overall score classifying a message based on the required_score setting.

Token: think of it as "word" used by the Bayesian classifier sub-system.
In practice, it is more complicated than simply space separated words.
Context (e.x. headers) and case might be taken into account, too.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bayes autolearn questions

Reply via email to