by the way, I should mention where I'm thinking of going with this.
As you can see, if you compare
http://ruleqa.spamassassin.org/20071113-r594464-b (a preflight mass-check
of 4000 messages) vs http://ruleqa.spamassassin.org/20071113-r594456-n (a
nightly mass-check of 50000 messages), there are some major differences in
how accurate the rules are judged to be.
We can now complete a mass-check of 50k messages in 22 minutes, using
mass-check running on the zone, with 2 slaves (talon and infiltrator), the
corpora from 3 contributors uploaded to the zone, and distributed
mass-check.
If we add more servers, I'm hoping we can get to a stage where we can scan
the entire uploaded corpora *on every checkin*, thereby:
- providing more accurate rule-QA data,
- faster, within 30 minutes (which is faster than the current preflight
mass-check),
- and making the miniscule 4k-message "preflight" corpus obsolete
That would be cool ;)
There may even be a possibility of using some donated supercomputing
infrastructure to do this in the future. Who knows how fast it'd be
then... ;)
--j.