Short version: I'll fix the auto score-gen, I promise. I'm putting it on a vm so that it doesn't break unexpectedly again.
On 29/12/2009 8:17 AM, Justin Mason wrote: > On Mon, Dec 28, 2009 at 02:18, Warren Togami <[email protected]> wrote: >> After the release of 3.3.0 we need to think about how rule updates as >> distributed via sa-update will work. The goal here is to make it quick and >> easy to safely add new or adjust existing rules so sa-update keeps >> spamassassin effective over time. This extends the useful life-span of a >> spamassassin release. We can then propose a 3.3.x maintenance release only >> after we feel enough worthwhile changes make it worthwhile to do a release, >> or for security releases. >> >> jm explained a few weeks ago that currently 3.2.x sa-update rule updates are >> not auto-updated because we lack a separate ruleqa system. Our ruleqa >> system tests only the svn trunk in the nightly masscheck. It would be too >> much for our nightly masscheck volunteers to run the nightly masscheck >> twice, so doing both is not an option. I don't think mass-checking with both trunk and stable branches is necessary (or perhaps useful enough to be necessary). Rules that can be auto-added and pushed via updates are all no-code-change rules ("can" being that we'll never ship code via updates even though it's possible). Code changes changes in trunk usually only help rules hit more rather than less, so the same rules on the stable branch will probably be just as safe or safer (hit the same or less). >> In talking with jm a few weeks ago, we seem to be in agreement that we >> should change this procedure for 3.3.x. Nightly masscheck will continue to >> check using the svn trunk, but rule updates will be pushed to 3.3.x users. Yep. That's been the idea for a long while now. One problem has been tuits, the other has been, IMO, a small ham corpora (it appears to be getting larger now, although I don't know if it's large enough yet). >> Rule Version Conditionals >> ========================= [snip 'if can' stuff] > we then ensure that rule-breaking changes need to include a method that > can be used by rules using this method. e.g. Yep. We should be able to catch this when it's missing too (people, most, everyone, will forget once in a while to use it) when generating a stable branch update). > We also need to add a build to Hudson to build 3.3.x maintainance using > trunk's > rules, and run the tests, to ensure that the maint branch works ok with > trunk's > rules. It wouldn't hurt. It could probably be built directly into the package process too to reduce update testing complexity (stages, delays, etc). >> With rule version conditionals we might consider that svn trunk targets the >> next 3.3.x maintenance release instead of working on a branch. We have >> limited developer hours so we might be better off focusing exclusively on >> trunk. This worked reasonably well during the past year with pre-3.3.0 >> trunk. Any thoughts about this part? > > I'm -1 on this idea, however. We've previously always switched to a > maintainance branch for post-release fixes, and it's easy enough. I'm also -1 on a stable trunk. Branching stable, as we've done in the past, is the way to go. >> Explicit Promotion >> ================== >> The ruleqa system periodically has problems where it gets stuck having >> processed only the bb-* corpora but not others. This seems to cause the >> combined results to swing wildly and rules are promoted and demoted for >> seemingly no reason. I've seen the bug Warren is referring to once or twice in the rule-qa output. The net-check before last only had bb-* corpora in the rule-qa output. I can't remember if there's a cut-off time period for submissions to the rule-qa app... perhaps there's a timing issue. > Suggestion: rule promotion/demotion requires a certain "quorum" of both bb-* > and > non-bb* corpora to happen. It already requires a quorum of N corpora (of any > type). If it doesn't meet this, the existing promoted rules list is kept > as-is. I would think that we need both bb-* and non-bb-* corpora along with a minimum ham message count with a maximum contributor weighting factor (so that one contributors ham can't make the minimum all by itself). I'd also be interested in stats on how much rules bounce on and off the promoted list. That could be compiled by comparing svn revisions... I might take a look at doing that. >> The ruleqa system is incapable of auto-promoting rare hitting but >> ultra-accurate rules like VANITY. > > yes, definitely a good candidate for force-active... > >> For reasons like this, we should force active certain rules when we're >> certain they are safe. Adding the rule to rulesrc/10_force_active.cf seems >> to be sufficient. >> >> I propose that we have simple, low bar of requirements to govern explicit >> promotion. >> >> * By judgement call the rule is obviously safe, or proven by ruleqa. >> * Any two commiters agree. >> * No bug required, but state who agreed in the commit. > > +1 +1 provided that "obvious" is a rule that is complex enough to not hit on what is not obvious. Otherwise, I think there should be at least one nightly mass-check done to verify that it doesn't have unexpected results. >> Scoring >> ======= >> Currently auto-promoted rules all have the score of 1. Scores need to be >> defined in rules/50_scores.cf to have any other score. >> >> I propose that we have simple, low bar of requirements to control assignment >> of any score greater than 1. >> >> * One committer per point must agree, rounded up. (1.4 points require two >> committers to agree. 2.3 points require three.) >> * No bug required, but state who agreed in the commit. > > I think it's a good idea, but I'm worried about two things: I don't really like the system, as the standard way to do things, at all. I think it may jeopardize our accuracy and credibility if we start assigning scores this way, as the standard way. If there were no other option I would say sure, but instead, I promise to fix the daily score-gen. > - it'll take a lot of overhead in wrangling voters; 3 voters may be too > much. I'd be happy with just 2, since we can always retrospectively > veto > in cases where we disagree. > > - Daryl, thoughts regarding the weekly run of the GA? is that workable > yet? > this proposed system is incompatible with that. I figured out what was wrong with daily run of the GA... one was the re-org of trunk (I knew that, but coincidentally it didn't fix it) the other was that pgapack got broken on my machine. That took a while to track down since I forgot pgapack was required and I was getting bizarre (but detected broken!) results from the automated GA run with it broken. I am going to setup a virtual machine solely for automated GA runs so that I don't have to worry about things breaking unexpectedly in the future. I'm feeling like this will happen soon. > JH: >> I was hoping that at least some sort of automatic analysis for assigning >> scores could be incorporated into the process. Is the consensus that the >> nightly masscheck corpus isn't large enough to support doing this? > > Warren: >> That would be ideal, but yes, the nightly masscheck is WAY too small. Even >> our >> mcsnapshot was too small and required lots of manual massaging to output >> scores that satisfied us. Whoa, what. Is there a diff available of the "required lots of manual massaging"? I must have missed that and that doesn't sound normal. It often starts (or talks about it start) and then there's usually a stats smack down and things get more or less left alone. Sometimes we fudge really closely scored things that people think should be linear just so we don't get a barrage of queries about it on the users' list, other than that I don't recall "lots of manual messaging". I'm scared. > if I recall correctly, the initial plans for the weekly-GA was that it would > only generate scores for newly-defined rules in the sandboxes. If the "base", > non-sandbox ruleset had stable, infrequently-changed scores, and the sandbox > rules were more in flux, that insulates us against the manual-massaging > problem. Yes. "base" ruleset scores were not changed on the theory that the larger, supervised, mass-check of better cleaned corpora was best left alone given that, although more up-to-date, the nightly mass-checks would not be as accurate. > Anyway, that really needs a comment from Daryl ;) ...and I still think that that should be the case. Semi-annual (perhaps) organized mass-checks for re-scoring during a stable branch would be great, but I don't think we should re-score en-masse based on the nightly mass-checks. The way I've got things written is that all existing base scores are locked (can't remember what causes that... non-mutable?) and then all of the base and new rules are run through the GA using the nightly and weekly results resulting in the same base scores and new scores for the sandbox promoted rules. It actually works well... I never had a complaint about the scores and they were used on a few production systems processing a 100 or so million messages a day. The best part is a lot of the time the scores were not intuitive (some were low, some were high) and after running the rules with those scores they appeared to work as wanted. Daryl
