[to the dev@ list]

Justin Mason wrote:
Daryl C. W. O'Shea writes:
We're lacking data. We really need to do nightly net enabled checks for the updates to be really useful.

urgh.  that'd be tricky.  I don't know if you've noticed, but the
--net mass-check corpus is a *lot* smaller than the set0 one,
purely because it takes so much longer :(

That's dependent on whether or not people have already scanned their corpus messages. If they're all already scanned it runs at the same speed.

How about extending mass-check to either markup corpus messages that it scans (while net-enabled) that have never been scanned before or caching (to disk) the net rule hits that it gets when it does the (net-enabled) scan. In either case eliminating ever having to do the net checks on the message again.

If for some reason that's not favoured, I'd settle for a --reuse-only run that includes all of your messages for set0 results and only reusable messages for set1 results... all done in a single mass-check.

If you're running with set0 only your detection rate already sucks, and if you're running with set1 you'll only get the new rules once a week.

Can we not just assume that it's safe to copy the set0 scores for
the rest of the week?

I don't believe that it is safe. Often the set1 scores are a *lot* lower than the set0 scores. The set0 scores are weighted a lot heavier (by the GA) to move the spam TP rate from 46% to 80% (seriously, check out the scores/stats-set0 file) while set1 only moves from 88% to 96%.

If we had to just use the set0 scores I don't think I'd be comfortable with an adjustment factor of more than 25% (that is the set1 scores would only be a quarter of the set0 scores).

Additionally, I think we should re-use bayes results so we can more accurately generate scores for set2 and 3. Otherwise I think I'm going to just copy them over from sets0 and 1 and lower them with some random adjustment factor.

Either of those options make sense for me.

I think we need to come up with some kind of extrapolation algorithm for
these, to be honest; I don't think 4 mass-checks are at all possible. :(

The only reason we would need 4 mass-checks is if there are meta rules that fire in the non-net or non-bayes scoresets that won't fire if a net or bayes rule does fire. I'm not aware of any such rules, but it's possible for it to happen (although I'd rather just let the GA decide whether or not the rule should be used by the net or bayes scoreset rather than the meta rule). Otherwise, we can extract everything we need from a single mass-check.


Daryl

Reply via email to