Hello Theo,

Monday, February 7, 2005, 4:30:11 PM, you wrote:

TVD> Ok, here are my thoughts about how to do faster updates. ...

Looks good to me, as a user, someone who would download these rules
from SA.org, and also as a member of SARE who might create additional
channels.

TVD> I currently only think rules + scores ought to be released this
TVD> way -- people aren't going to be comfortable with automated code
TVD> updates IMO.  Code/plugins are best left to full releases.
TVD> (plugin support could be easily added later on, btw.)

Agreed.  Also thinking that to reduce bandwidth, it might be a good
idea to separate "core" rule scores from "updates" containing new and
changed rules.

TVD> Pseudo-code is below, but here's some background details:

TVD> Updates occur from "channels".  The default channel is
TVD> "updates.spamassassin.org", but the user can specify any number
TVD> of channels on the commandline to use additionally.  These can
TVD> either be provided by us (think of "updates" being stable vs
TVD> "expirimental" vs ...), or some third party (as long as they
TVD> provide the same infrastructure...)

I like it. So following my thought above,
- $version.scores.spamassassin.org would contain scores against core
  rules, rescoring them as spam patterns change, for people who do not
  add rules.
- $version.updates.spamassassin.org would contain new rules with
  scores, and updated/modified/enhanced rules with their new scores.
- $version.hispamnoham.rulesemporium.com would contain rules/scores
  that hit lots of spam and no ham,
- $version.lospamnoham.rulesemproium.com would contain rules/scores
  that hit a few spam and no ham (safe, but not for sites tight on
  resources)
- $version.highso.rulesemporium.com would contain rules/scores that
  hit spam and ham, with a high S/O,
- etc.

TVD> Updates have version numbers.  The value format of which is
TVD> irrelevent, as long as its monotonically increasing.  For our
TVD> updates I was thinking SVN revision, but could also do YYYYMMDDVV
TVD> ala DNS SOA, etc.

Good.

TVD> Versions are tracked per channel and SpamAssassin version.  To check
TVD> for updates, do a DNS TXT query ala
TVD> "z.y.x.updates.spamassassin.org",
TVD> where z.y.x refers to the version of SpamAssassin being used, aka:
TVD> x.y.z for 3.0.2, etc.  For simplicitly, wildcards can be used on the
TVD> DNS server to match a whole set of releases.  An example:

TVD> *.0.3.updates.spamassassin.org TXT "154203"
TVD> *.1.3.updates.spamassassin.org TXT "158203"

And I assume that *.*.3 would also be viable to accept rules for all
3.x.x versions, or more to the point, *.*.2 could be used within SARE
to flag rules that apply to all 2.xx versions that predate 3.0.0.

TVD> The directory that is to be mirrored out appropriately looks like:
TVD> dir/
TVD>    MIRRORED.BY
TVD>    version.ext
TVD>    version.ext.sha1
TVD>    ...
TVD>    versionn.ext
TVD>    versionn.ext.sha1

TVD> with "version.ext.gpg .. versionnn.ext.gpg" available optionally.
TVD> I don't think GPG needs to be required, but for the paranoid
TVD> amongst us, it needs to be available as an option.

Where do these updates come from?  When would the GPG signature be
applied, and by whom/what?  Within SARE we have multiple working
files, and I can see our scripts combining all files that match a
given critiera into a single channel file. The original files are
sometimes signed to validate them, but I don't see any value to having
an automated script sign the compilation. I suppose it might be a YMMV
situation.

TVD> At the end, the script outputs a number of channel.cf files,
TVD> which by default will just be read by SpamAssassin at startup
TVD> (leaving restarting spamd up to the admin outside the script,
TVD> based on exit code...)  If a different directory is used, admin
TVD> can simply include the channel.cf file in their local.cf.

Good.

TVD> There are a few things I haven't fully fleshed out yet:

TVD> 1) How to archive the update files together?  I envisioned a
TVD> similar naming convention to our normal rules directory (ie: a
TVD> bunch of files named ##_type.cf), but the script should just
TVD> expect to download a single file which will then be expanded.  I
TVD> don't want to rely on system calls to run an expansion, nor do I
TVD> want to expect tar or zip to be installed, etc.

I would think that the compilation script could simply cat the
component files together.  eg [I often use shell as my meta language]:
   version=$yyyymmddhhss         # simple version calc
   # loop through compilation definition files.
   # For each definition, grab output file name from line 1.
   # Remainder of lines name files fed into compilation.
   for compilefile in $compiledir/*.compile ; do
      outfile=$( sed 1q $(<compilefile) )
      newer=no  # assume this compilation not updated
      # For each file in the compilation, check to see if it is newer
      # than the last compilation built.
      for infile in $( sed -n 2,\$p $compilefile ) ; do
         if [[ $infile -nt $outfile ]]
         then newer=yes
         fi
      done
      # If any input file is newer than the last compilation built,
      # the build a new compilation.
      if [[ $newer = yes ]]
      then echo $version > $outfile
           cat $( sed -n 2,\$p $compilefile ) >>$outfile
      fi
   done

TVD> 3) Using "channel.cf" means that it may or may not come after
TVD> local.cf. We should probably use some form of prefix to get it to
TVD> load beforehand, but what?  People should be able to override the
TVD> channel config if they want to.  I don't know if I want   
TVD> "AA_updates_spamassassin_org.cf"
TVD> as a file.

I would agree that we want all channel files to come before local.cf
alphabetically, and also want them to have reasonably short names.

What about a name like CH.$channel.$abbr.cf where $channel is the
channel file name (eg: updates, scores, hispamnoham, etc), and $abbr
is an abbreviation for the source of that channel (perhaps fed through
a second field on line 1, or through the second line of the channel
file).  That would give us files like:
CH.updates.SA.cf
CH.scores.SA.cf
CH.hispamnoham.SARE.cf

This leaves open the question of how do we prioritize the occasional
override?

Let's say SARE includes an "english" channel, containing our rules
that work well in the English language, USA, UK, Australia, etc., but
does not work nearly as well for sites that receive emails in other
languages.  Let's then say that SARU (our Russian counterparts) create
a channel which simply rescores our "english" channel to reflect
mass-check results in their part of the world. How can we guarantee
that their channel file scores override our scores?

TVD> Pseudo code:

TVD> - Script has a list of GPG keys which are allowed to sign update releases.
TVD>   The default is 265FA05B, which is the SA signing key.
TVD> - load Mail::SpamAssassin
TVD> - load Digest::SHA1
TVD> - load LWP
TVD> - Accept commandline options for GPG keys to allow for signing in addition
TVD>   to default (for third-party updates).
TVD> - Accept commandline option for whether or not to use GPG for verification.
TVD> - Accept commandline options for additional channels to use beyond
TVD>   updates.spamassassin.org

It'd be good if those channels could be provided either directly in
the command line (one or two additional channels) or through an input
file (a dozen or so channels).

TVD> - Accept commandline option for parent directory for updates.  Default is
TVD>   whatever the first site_rules_path value is, ie: /etc/mail/spamassassin.
TVD>   ala: $msa->first_existing_path (@M::SA::site_rules_path);
TVD> - Accept other options such as debug, version, etc.

To help those who need to put these into a user_prefs file, it'd be
good to include an option(s) which specifies that a) output will be to
$HOME/.spamassassin/user_prefs, b) all channel files should be
concatenated together, along with a core user_prefs file, and c)
whether that core precedes or follows the accumulated channel files.

TVD> - exit code = 255
TVD> - foreach ( @channels ):
TVD>   - Convert channel name to "platform friendly" version?  Is
TVD>     "foo.bar.baz.etc.example.com" ok for all platforms?  I was thinking
TVD>     s/\./_/g
TVD>   - read /dir/channel.cf and get current version from comment on first line
TVD>   - convert internal SA version to z.y.x format, and query DNS for
TVD>     TXT z.y.x.channel
TVD>   - if no answer, throw error, goto next channel
TVD>   - for version checks, use ^(\d+) for version.  if same channel will have 
same
TVD>     update version value for different SA versions, can do "1345-3_0".
TVD>   - if version is <= current, goto next channel
TVD>   - if no /dir/channel/MIRRORED.BY file exists:
TVD>     - query DNS for TXT mirrors.channel
TVD>     - if no answer, throw error, goto next channel
TVD>     - grab URI, write to /dir/channel/MIRRORED.BY
TVD>   - read /dir/channel/MIRRORED.BY:
TVD>     - add each parent URI to internal array.  if weight given, add URI that
TVD>       many times.  (this algorithm can be made more efficient, but it's 
simple
TVD>       for now.)
TVD>   - foreach ( pick_random(@mirrors) ):
TVD>     - grab parent_uri/version.foo ("foo" depends on the "what archive 
method" issue)
TVD>       - if there's an error, go back and choose another mirror
TVD>     - grab parent_uri/version.foo.sha1 (ditto foo)
TVD>     - do IMS grab for parent_uri/MIRRORED.BY, missing is ok
TVD>     - if GPG is enabled, grab parent_uti/version.foo.gpg (ditto foo)
TVD>     - an error in either GPG or SHA1 causes an error for the channel, goto
TVD>       next channel
TVD>     - no error means break out of the mirror loop
TVD>     - write files to some temp place (mkdir tmpfile)
TVD>     - if no mirrors work completely, channel fails, goto next channel
TVD>   - validate version.foo.sha1 internally
TVD>     - if failed, fail channel, goto next channel
TVD>   - if GPG is enabled, validate version.foo.gpg (depends on the "how to do
TVD>     gpg" issue)
TVD>     - if failed, fail channel, goto next channel
TVD>     - file fails if signature fails, or if signature is ok but not signed 
by
TVD>       list of "trusted" keys

This might be a good place to also --lint the received channel file,
and fail any channel file that fails --lint.

TVD>   - remove all files except MIRRORED.BY from /dir/channel
TVD>   - remove /dir/channel.cf
TVD>   - unarchive version.foo into /dir/channel
TVD>     - on error, fail channel, goto next channel
TVD>   - move new MIRRORED.BY to /dir/channel if it exists
TVD>   - remove temp version.foo* files
TVD>   - create new /dir/channel.cf file
TVD>     - first line is comment w/ version of channel
TVD>     - foreach (readdir(/dir/channel)):
TVD>       - add "include /dir/channel/file.cf", only do .cf files
TVD>   - exit code = 0
TVD> - return exit code

This has good potential.

Bob Menschel



Reply via email to