-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Chris Thielen writes:
> Justin Mason wrote:
> > "Daryl C. W. O'Shea" writes:
> > >>Please let me know what you think!
> >
> > >Sounds good, but I think the limited (and relatively static?) corpus may
> > >be an issue for rule development aimed at catching new spam signs.
> >
> > Good point.
> >
> > A static-ish ham corpus isn't a big problem, but we may need to supplement
> > the spam corpus with fresh feeds of new spam.  It should be possible to do
> > this either from trap feeds, or via submissions from the nightly corpus
> > submitters (rsync up bits of your corpus as you see fit).  Traps is
> > probably easier.
> 
> A couple of thoughts regarding corpus stuff with the current SARE 
> masscheck method in mind:
> 
> - Ham is private to the individual masschecker.  If there were a global 
> corpus, this would necessarily not be the case.  I would think twice 
> about sending my corpus to some (even access controlled) global corpus.

Don't worry -- I totally agree ;)

There's two mass-checking systems:

- - this one, the "preflight" mass-checker which runs quickly on a single,
  centralised small corpus, and runs continually;

- - the nightly mass-checks, which run only once per night, take a long time
  to run, but runs on the distributed set of "live" corpora and includes
  the entire ruleset ("core" as well as sandboxes) and all plugins.

So in other words it's anticipated that everyone maintains their own
private corpora for the nightly mass-checks, and the preflight corpus
isn't as important to include many people.

> - Individual corpus results vary dramatically.  Sometimes it's useful to 
> see how rules hit different corpora.  In your proposed model, the 
> masscheck could iterate over each corpus and masscheck on each 
> individually, then consolidate the results (one weakness of our current 
> method is that there is no consolidated view).

This is one thing the nightly runs do.

> - Staleness of corpora.   Sometimes a rule is developed for a brand new 
> spam.  Chris S sometimes cranks out a new version of a rule multiple 
> times in a week as the spam mutates.  Often the users' corpora that 
> aren't up to date (usually mine ;) ) will show no hits, but if the user 
> refreshes the corpus the hits show up.  This would be an issue for 
> either type of system; for me it currently means checking my Maildirs 
> for misclassified ham, running an IMAP purge, and running an 
> exportcorpus script.  In your proposed system it would simply mean 
> adding an rsync as another step.
> 
> - Masscheck speed: a minor point, but valid I think.  The proposed 
> buildbot solution as a centralized solution doesn't scale as well when 
> additional corpora are added.   In the current SARE system each corpus 
> is checked in parallel with the rest.
> 
> - Barrier to entry: the SARE system requires each user to set up a 
> script to do the masscheck, integrate with the local MTA and ensure 
> serialization of requests, etc.  Your proposed solution (uploading of 
> corpora) is easier to get set up.

FWIW, the Buildbot system can indeed support multiple people running
buildbot slaves on their corpora.  However if anyone was to use this,
they'd have to accept that it could be risky security-wise, since it could
possibly get code run in response to a mail.

Still, if you (or others) would be happy with that, it's entirely doable
(and pretty easy to set up). ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDRMOgMJF5cimLx9ARAp9IAJ9Kvy26zGoc8gFjadjgF0cVMdgrvwCfccN/
xFXocuw0WCG34oxeXpfYDhw=
=rmO3
-----END PGP SIGNATURE-----

Reply via email to