On Tue, 2014-09-02 at 21:11 -0400, Alex wrote: > I have a spamassassin-3.4 system with the following bayes config: > > required_hits 5.0 > rbl_timeout 8 > use_bayes 1 > bayes_auto_learn 1 > bayes_auto_learn_on_error 1 > bayes_auto_learn_threshold_spam 9.0 > bayes_expiry_max_db_size 9500000 > bayes_auto_expire 0 > > However, spam with scores greater than 9.0 aren't being autolearned:
http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html > Sep 2 21:01:51 mail01 amavis[25938]: (25938-10) > header_edits_for_quar: <bmu011...@bmu-011.hichina.com> -> > <bestd...@example.com>, Yes, score=16.519 tag=-200 tag2=5 kill=5 > tests=[BAYES_50=0.8, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LINKBAIT=5, > LOC_DOT_SUBJ=0.1, LOC_SHORT=3.1, RCVD_IN_BL_SPAMCOP_NET=1.347, > RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_PSBL=2.3, > RCVD_IN_UCEPROTECT1=0.01, RCVD_IN_UCEPROTECT2=0.01, RDNS_NONE=0.793, > RELAYCOUNTRY_CN=0.1, RELAYCOUNTRY_HIGH=0.5, SAGREY=0.01] autolearn=no > autolearn_force=no > > I've re-read the autolearn section of the docs, The one I linked to above? > and don't see any reason why this 16-point email wouldn't have any new > tokens to be learned? Rules with certain tflags are ignored when determining whether a message should be trained upon. Most notably here BAYES_xx. Moreover, the auto-learning decision occurs using scores from either scoreset 0 or 1, that is using scores of a non-Bayes scoreset. IOW the message's score of 16 is irrelevant, since the auto-learn algorithm uses different scores per rule. Next safety net is requiring at least 3 points each from header and body rules, unless autolearn_force is enabled. Which it is not in your sample. Either of those could have prevented auto-learning. Also, according to your wording, you seem to think in terms of (number of) "new tokens to be learned". Which has nothing in common with auto-learning. (Even worse, "new tokens" would strongly apply to random gibberish strings, hapaxes in Bayes context. Which are commonly ignored in Bayes classification.) > I looked in the quarantined message, and according to the _TOKEN_ > header I've added: > > X-Spam-MyReport: Tokens: new, 47; hammy, 7; neutral, 54; spammy, 16. > > Isn't that sufficient for auto-learning this message as spam? That has absolutely nothing to do with auto-learning. Where did you get the impression it might? > I just wanted to be sure this is just a case of not enough new points > (tokens?) for the message to be learned, and that I I wasn't doing > something wrong. Points: aka score, used in the context of per-rule (per-test) and overall score classifying a message based on the required_score setting. Token: think of it as "word" used by the Bayesian classifier sub-system. In practice, it is more complicated than simply space separated words. Context (e.x. headers) and case might be taken into account, too. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}