We have a bit of an issue with recent nightly masschecks -- if you look at
http://ruleqa.spamassassin.org/ , in the "recent mass-checks" table at the
top, there's the following:
20060404 391250 2006-04-04 08:50:34 391250 jm
(promotions validated) bzoetekouw cthielen daf theo wtogami
20060405 391250 2006-04-04 08:50:34 391250 jm
(promotions validated) zmi
Now, that's the same revision, r391250, but the mass-checks for bzoetekouw
cthielen daf theo and wtogami were all started on 2006-04-04; however the
mass-check for zmi was started on 2006-04-05, and therefore was treated as
a separate mass-check entirely.
Unfortunately the rule-update generation script appears to have used the
later mass-check as the one to trust for the freqs for rule updates, ie.
zmi's results alone were determining what was going into rule updates.
So a few things:
1. Michael, could you shift the mass-check cron start time so that it
starts sometime during the day it was tagged? Currently, it looks like
your masschecks are starting at 0110 UTC. When you consider that the tag
is applied daily at 0830 UTC, that means that 17 hours have elapsed by
that stage between tagging and the *start* of the check -- ideally that
should be a little closer to when the tag was applied. (And it'd work
around this bug for now.)
2. going by
http://ruleqa.spamassassin.org/20060405-r391250-n/BOUNCE_MESSAGE , it
appears you have some bounce messages in your spam corpus, too. ;)
3. We also need to decide if this is a bug or not. ;)
Bear in mind that the ruleqa system is dealing with mass-checks. Inputs to
a mass-check in this case are:
- the code, at a specific revision
- the corpus, from a specific person at a specific date
- the subset of rules being masschecked
- the date and time of execution, for network tests
The "daterev" string, e.g. "20060405-r391250-n" in the URL above, is used
as an ID to represent those inputs, in the format
"corpusdate-coderev-ruleset".
Unfortunately we can't just use code-rev and ruleset alone, since corpora
change over time (and don't have a handy ID number like code-revs do).
So I'm just using the date when the mass-check started, as the ID number
identifying the corpus version. (that also takes care of 'date and time
of execution', too.)
However, we also need to be able to collate a set of mass-checks together
as one correlated set; right now this is done by using that date.
This fails if the mass-checks to be collated as one set don't all take
place on the same day (as in this case). Is this a big issue?
Personally, I don't particularly think so, since given that the tag is
applied at 0830 UTC, there's plenty of "day" left before 2359 UTC for the
mass-checks to overlap in. ;)
(Having said that, I do need to fix the code to *not* present zmi's
results alone as "last night's mass-check results", but that's a separate
bug.)
--j.