hackathon notes from Sat

Justin Mason Sat, 10 Dec 2005 17:38:59 -0800

Hey,

so we're talking over the "rule promotion" situation, and how "sa-update" will
work, and we've come to an agreement that having committers manually cut and
paste rules really won't scale, and is too much work.


As a result, here's some notes from a whiteboard session where we're
planning out how to fix it so rule-promotion and sa-update work....


SVN TREE LAYOUT:
----------------


    trunk
        -> lib (code, engine)

        -> rules (code-tied ruleset, changes per version)
            - GONE: 50_scores.cf

    rulesrc
        -> core
            - current core ruleset
            - *multiple* scores files
                - taking over from 50_scores.cf
                - can contain "ifversion" sections for specific
                  releases

        -> sandbox

        -> active
            - the new "active set" of rules published for sa-update.

            - when "build/mkrules" runs, these are *not* copied into
              the "rules" directory.


Note that when "build/mkrules" is run, core and sandbox are copied, active is
not.  active is purely a *subset* of the core and sandbox sets.



TASKS IN PROCESS:
-----------------


NIGHTLY TAGGING FOR M-C (CENTRALISED):

input: SVN
output: SVN

    - same as current

MASS-CHECKS (DISTRIBUTED): [multiple users in parallel]

input: SVN
thru: mass-check
output: logs

    - same as current

    - Note: mass-checks do not run with the "active set". They run with all of
      rulesrc/core, and rulesrc/sandbox.  Only the end-user systems running
      sa-update use the limited subset that's found in the "active set".

RULE SELECTION/PROMOTION (CENTRALISED):

input: SVN
input: logs
output: SVN, "active set"

    - use previous day's logs (run at 0800 UTC)

    - TODO?  need an SVN userid to commit results from cron?

    - auto-promotion of "good" rules, automatically, from sandbox and core.
      Normally all rules are autopromoted, based on how "good" they are. this
      can be inhibited by setting a tflag, "tflags nopublish".

        "nopublish" allows us to work on rules like T_FORGED_OUTLOOK_TAGS,
        where it's a bug-fix of an existing rule, and it *would* be considered
        immediately promotable.  We need a way to inhibit that, so that it's
        under manual control. 

        Also, the "T_" prefix implies this.   The corollary of this is
        that rules in the sandbox no longer have to have a "T_" prefix;
        they now only need that if they're "nopublish".   This helps
        reduce the need to rename rules if they move from sandbox
        to core.

    - Promoted rules are *duplicated* from sandbox and core, into the
      "active set".  This is the set of rules that are published in
      an sa-update update file.

    - "bad" rules in core are deleted.   That means *gone*, but can be
      recovered from SVN history.

        Rationale: bad, atrophied rules are pretty much never recoverable in
        our experience!

    - generate a domain-specific language script to perform
      promotions/deletions/etc.

    - Note: SVN trunk, mass-checks, etc. do not run with the "active set". They
      run with all of rulesrc/core, and rulesrc/sandbox.  Only the end-user
      systems running sa-update use the limited subset that's found
      in the "active set".


SCORING (CENTRALISED):

input: SVN
input: logs
thru: perceptron/scoring
output: SVN

    - the logs contain all rules from "core" and "sandbox", but grep out only
      the subset of rules that are in the active set so that the perceptron
      doesn't try to use the others

    - fix Bayes scores (I think this means set them to fixed values, instead
      of letting them "float" and attempting to optimise with perceptron)

    - Daniel says: TODO: fix rewrite-cf-with-new-scores to deal with:
        - automated-generation vs. manual scores in separate files
        - ifplugin blocks inside the scores files


PACKAGING (CENTRALISED): 

input: SVN, the "active set" only
output: packages

    - TODO:  need a password-less method to sign packages

    - automated test suite for packages before they're published

    - The package will contain both new rules, and rules that were part of
      "core" for the 3.1.0 release.  To avoid the latter conflicting with rules
      in the 3.1.x release, we will produce a 3.1.x point release that deletes
      the ruleset from /usr/share/spamassassin, and immediately runs
      "sa-update"!

    - assume 3.1.x and earlier versions can safely use scores generated
      against "svn trunk" for the "active" set, even though they may
      not be exactly accurate for that release.  (the alternative is
      running a full mass-check for all releases -- too much!)



RULE STATES:
------------

These are the states that rules pass through.


    Rules in sandbox:

        - experimental -- don't promote me.  "T_" prefix implies this.
          "tflags nopublish" ditto.

        - s_poor -- promotable, but not meeting promotion criteria.

        - s_good -- promotable, and meeting criteria.  Rules in this
          state are copied into the "active set".

    Rules in core:

        - c_poor -- promotable, but not meeting promotion criteria.

        - c_good -- promotable, and meeting criteria.  Rules in this state are
          copied into the "active set".

    Deleted rules:

        - gone -- rule has been deleted.   If a rule is in c_poor for "an
          extended period of time", it goes here.


So the permitted transitions are:

        - experimental <---> s_poor
        - experimental <---> s_good
        - s_poor <---> s_good
        - c_poor <---> c_good
        - c_poor -> gone

hackathon notes from Sat

Reply via email to