Duncan Findlay writes: > On Wed, Jan 03, 2007 at 02:42:44PM +0000, Justin Mason wrote: > > > - T + 0 days: announce a heads-up mail. clean up our corpora, get ready > > for mass-checking, try out mass-check to spot any big memory leaks or > > whatnot, fix remaining bugs that affect mass-checks (esp bug 5260!), > > get people signed up, enable all rules in svn. > > > - T + 1 week, around a Thursday or so: start --bayes --net mass-checks; > > move to C-T-R. > > > - T + 3 weeks, a Monday or so: hopefully finish mass-checks, bugs > > allowing ;) (note that includes two weekends.) > > > - T + 3 weeks: perceptron runs, voting on new proposed scores, etc > > > - T + 4 weeks and a bit: hopefully ready to release > > +1 > > BTW, how do we generate all 4 scoresets from one run? We used to have > to do two runs, and I can't remember the rationale for that, or the > rationale for doing it one. :-)
Well, I took a look back at the 3.1.0 score-generation to figure this out, since I'd forgotten. Here are the old instructions: http://wiki.apache.org/spamassassin/RescoreDetails Basically, we do a single set3 mass-check, with all scores unzeroed. This uses "--bayes --learn=35", which uses Bayes and learns 35% of all mails in whatever direction SpamAssassin classified them as (in other words, a pretty simplistic auto-learn algorithm, with errors). I think the idea was to simulate "real" Bayes auto-learning, which includes errors too. from that, we can derive: set-0: by removing all net and BAYES rules from the log set-1: by removing all BAYES set-2: by removing all net hits set-3: what we did The key appears to be the --learn=35 bit. It's hard to recall the details -- we didn't note much of it down I think, and it was 19 months ago :( --j.