Hey,
so we're talking over the "rule promotion" situation, and how "sa-update" will
work, and we've come to an agreement that having committers manually cut and
paste rules really won't scale, and is too much work.
As a result, here's some notes from a whiteboard session where we're
planning out how to fix it so rule-promotion and sa-update work....
SVN TREE LAYOUT:
----------------
trunk
-> lib (code, engine)
-> rules (code-tied ruleset, changes per version)
- GONE: 50_scores.cf
rulesrc
-> core
- current core ruleset
- *multiple* scores files
- taking over from 50_scores.cf
- can contain "ifversion" sections for specific
releases
-> sandbox
-> active
- the new "active set" of rules published for sa-update.
- when "build/mkrules" runs, these are *not* copied into
the "rules" directory.
Note that when "build/mkrules" is run, core and sandbox are copied, active is
not. active is purely a *subset* of the core and sandbox sets.
TASKS IN PROCESS:
-----------------
NIGHTLY TAGGING FOR M-C (CENTRALISED):
input: SVN
output: SVN
- same as current
MASS-CHECKS (DISTRIBUTED): [multiple users in parallel]
input: SVN
thru: mass-check
output: logs
- same as current
- Note: mass-checks do not run with the "active set". They run with all of
rulesrc/core, and rulesrc/sandbox. Only the end-user systems running
sa-update use the limited subset that's found in the "active set".
RULE SELECTION/PROMOTION (CENTRALISED):
input: SVN
input: logs
output: SVN, "active set"
- use previous day's logs (run at 0800 UTC)
- TODO? need an SVN userid to commit results from cron?
- auto-promotion of "good" rules, automatically, from sandbox and core.
Normally all rules are autopromoted, based on how "good" they are. this
can be inhibited by setting a tflag, "tflags nopublish".
"nopublish" allows us to work on rules like T_FORGED_OUTLOOK_TAGS,
where it's a bug-fix of an existing rule, and it *would* be considered
immediately promotable. We need a way to inhibit that, so that it's
under manual control.
Also, the "T_" prefix implies this. The corollary of this is
that rules in the sandbox no longer have to have a "T_" prefix;
they now only need that if they're "nopublish". This helps
reduce the need to rename rules if they move from sandbox
to core.
- Promoted rules are *duplicated* from sandbox and core, into the
"active set". This is the set of rules that are published in
an sa-update update file.
- "bad" rules in core are deleted. That means *gone*, but can be
recovered from SVN history.
Rationale: bad, atrophied rules are pretty much never recoverable in
our experience!
- generate a domain-specific language script to perform
promotions/deletions/etc.
- Note: SVN trunk, mass-checks, etc. do not run with the "active set". They
run with all of rulesrc/core, and rulesrc/sandbox. Only the end-user
systems running sa-update use the limited subset that's found
in the "active set".
SCORING (CENTRALISED):
input: SVN
input: logs
thru: perceptron/scoring
output: SVN
- the logs contain all rules from "core" and "sandbox", but grep out only
the subset of rules that are in the active set so that the perceptron
doesn't try to use the others
- fix Bayes scores (I think this means set them to fixed values, instead
of letting them "float" and attempting to optimise with perceptron)
- Daniel says: TODO: fix rewrite-cf-with-new-scores to deal with:
- automated-generation vs. manual scores in separate files
- ifplugin blocks inside the scores files
PACKAGING (CENTRALISED):
input: SVN, the "active set" only
output: packages
- TODO: need a password-less method to sign packages
- automated test suite for packages before they're published
- The package will contain both new rules, and rules that were part of
"core" for the 3.1.0 release. To avoid the latter conflicting with rules
in the 3.1.x release, we will produce a 3.1.x point release that deletes
the ruleset from /usr/share/spamassassin, and immediately runs
"sa-update"!
- assume 3.1.x and earlier versions can safely use scores generated
against "svn trunk" for the "active" set, even though they may
not be exactly accurate for that release. (the alternative is
running a full mass-check for all releases -- too much!)
RULE STATES:
------------
These are the states that rules pass through.
Rules in sandbox:
- experimental -- don't promote me. "T_" prefix implies this.
"tflags nopublish" ditto.
- s_poor -- promotable, but not meeting promotion criteria.
- s_good -- promotable, and meeting criteria. Rules in this
state are copied into the "active set".
Rules in core:
- c_poor -- promotable, but not meeting promotion criteria.
- c_good -- promotable, and meeting criteria. Rules in this state are
copied into the "active set".
Deleted rules:
- gone -- rule has been deleted. If a rule is in c_poor for "an
extended period of time", it goes here.
So the permitted transitions are:
- experimental <---> s_poor
- experimental <---> s_good
- s_poor <---> s_good
- c_poor <---> c_good
- c_poor -> gone