[to the dev@ list]
Justin Mason wrote:
Daryl C. W. O'Shea writes:
We're lacking
data. We really need to do nightly net enabled checks for the updates
to be really useful.
urgh. that'd be tricky. I don't know if you've noticed, but the
--net mass-check corpus is a *lot* smaller than the set0 one,
purely because it takes so much longer :(
That's dependent on whether or not people have already scanned their
corpus messages. If they're all already scanned it runs at the same speed.
How about extending mass-check to either markup corpus messages that it
scans (while net-enabled) that have never been scanned before or caching
(to disk) the net rule hits that it gets when it does the (net-enabled)
scan. In either case eliminating ever having to do the net checks on
the message again.
If for some reason that's not favoured, I'd settle for a --reuse-only
run that includes all of your messages for set0 results and only
reusable messages for set1 results... all done in a single mass-check.
If you're running with set0 only your detection
rate already sucks, and if you're running with set1 you'll only get the
new rules once a week.
Can we not just assume that it's safe to copy the set0 scores for
the rest of the week?
I don't believe that it is safe. Often the set1 scores are a *lot*
lower than the set0 scores. The set0 scores are weighted a lot heavier
(by the GA) to move the spam TP rate from 46% to 80% (seriously, check
out the scores/stats-set0 file) while set1 only moves from 88% to 96%.
If we had to just use the set0 scores I don't think I'd be comfortable
with an adjustment factor of more than 25% (that is the set1 scores
would only be a quarter of the set0 scores).
Additionally, I think we should re-use bayes results so we can more
accurately generate scores for set2 and 3. Otherwise I think I'm going
to just copy them over from sets0 and 1 and lower them with some random
adjustment factor.
Either of those options make sense for me.
I think we need to come up with some kind of extrapolation algorithm for
these, to be honest; I don't think 4 mass-checks are at all possible. :(
The only reason we would need 4 mass-checks is if there are meta rules
that fire in the non-net or non-bayes scoresets that won't fire if a net
or bayes rule does fire. I'm not aware of any such rules, but it's
possible for it to happen (although I'd rather just let the GA decide
whether or not the rule should be used by the net or bayes scoreset
rather than the meta rule). Otherwise, we can extract everything we
need from a single mass-check.
Daryl