Re: [Mimedefang] learner indicated ham

Bill Cole Sat, 09 Aug 2014 07:20:06 -0700

On 8 Aug 2014, at 12:05, Justin Edmands wrote:

Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn: message score:
13.934, computed score for autolearn: 17.583
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn? ham=0, spam=7,
body-points=7.448, head-points=5.511, learned-points=-1.9
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn: autolearn_force
not flagged for a rule. Body Only Points: 7.448 (3 req'd) / Head Only
Points: 5.511 (3 req'd)
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn? no: scored as spam
but learner indicated ham (-1.9 < -1)

This is really a SpamAssassin issue rather than a MIMEDefang issue, soyou probably could get a better answer from the broader SA community,but I'll offer a vague rambling one :)

The SA auto-learn subsystem is designed to be very cautious in what itlearns because it carries diverse mistraining risks. The obvious part ofthe caution is the spam/non-spam thresholds for auto-learning, but thereare also less prominent: the message is rescored for the threshold checkusing scoreset 0 or 1, the learner demands a minimum of 3 pts each frombody & header/network rules to score as spam unless a matched rule hasthe autolearn_force tflasg set, and other per-rule 'tflags' can modifyhow the learner acts on a matching message. As a result, a messageactually has 5 scores tallied by SA: the normal score using scoreset 3or 4, the score using scoreset 0 or 1 that gets compared to the spam &nonspam autolearn threshold settings, the body-only score, theheader-only score, and the score using only rules with the "learn" tflag(by default, that's only BAYES_* rules) which is reported in debugmessages as "learned-points". By default, that last value is used as abackstop to prevent wildly divergent auto-learning. If the Bayes rulesscore a message <-1 or >1 (by default: a Bayes probability below 1% orabove 50%) in dissent from the overall score, the message will not beautolearned.

Is this something that I can fix? I want stuff to be trained as spam
but it doesn't seem to make it. I am thinking it's either a setting I
am not aware of or I need to retrain my bayes DB ham. Any help would
be great.

The real question is whether it is a problem at all, i.e. whether it's athing that merits fixing rather than a thing that is working as designedand, at least in aggregate, for your benefit. Probably that particularmessage was spam, given the very high score spread across rule types,but it is certain that learning it as spam would change the way yourBayes DB interprets similar messages and possible (absent otherevidence) that it was not spam at all. Unless you do intensive periodicscore adjustments of your non-Bayes rules based on a carefullyhuman-classified corpus of messages that are representative of theactual mailstream seen by SA, a well-fed Bayes DB is going to be abetter judge than the other (static and mostly default) rules. As of SAv3.4 (which you apparently have, as autolearn_force is new) you canswitch bayes_auto_learn_on_error to "1" to flip the auto-learner into amode where it *ONLY* learns a message when its learned-pointsclassification (i.e. the judgment of the existing Bayes DB) disagreeswith classification based on surpassing an autolearn threshold.

Whether you leave bayes_auto_learn_on_error at its default "0" for thetraditional behavior or switch it to "1" depends on what you believe tobe true about the relative accuracy of your Bayes and non-Bayes SArules. The traditional behavior expresses an assumption that the BayesDB is less likely to make a large classification error than the rulesused for the autolearn score, while the "learn on error" behaviorassumes that your Bayes DB is probably in error when it disagrees withthe other SA rules. Which way is better is site-specific, as that isinfluenced by a site's particular mailstream idiosyncrasies, theautolearn thresholds, local rules, local score adjustments to standardrules, the exclusion of messages from SA scoring by other anti-spammeasures, and the nature of what gets fed to the Bayes DB after explicithuman classification.

Another way to increase autolearning without going all the way to the"learn on error" behavior is to flag rules that you trust highly as"autolearn_force" so that messages matching them won't ever be excludedfrom autolearning based on the existing Bayes DB disagreeing with thedeterministic rules. I have started doing this for locally-definedmeta-rules that match on multiple hits on "net" rules such as the URIBLfamily. My reasoning there is that an identical message can getautolearned as ham at 12:00 because the spammer filled it withBayes-busting garbage and freshly minted payload URLs and sent from afresh "snowshoe" range but score well past the autolearn spam thresholdat 12:05 because by then multiple network services checked by SA ruleshave switched their opinions. In short: there are non-Bayes rules whichare more dynamic than Bayes and utilize information invisible to Bayeswhich are in aggregate an ideal basis for judging Bayes as both wrong ona particular message and in need of training to mitigate future relatederror.

_______________________________________________
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list [email protected]
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang

Re: [Mimedefang] learner indicated ham

Reply via email to