Nightly score generation for all scoresets

Daryl C. W. O'Shea Thu, 18 Oct 2007 13:41:32 -0700

[to the dev@ list]

Justin Mason wrote:

Daryl C. W. O'Shea writes:
We're lackingdata. We really need to do nightly net enabled checks for the updatesto be really useful.
urgh.  that'd be tricky.  I don't know if you've noticed, but the
--net mass-check corpus is a *lot* smaller than the set0 one,
purely because it takes so much longer :(

That's dependent on whether or not people have already scanned theircorpus messages. If they're all already scanned it runs at the same speed.

How about extending mass-check to either markup corpus messages that itscans (while net-enabled) that have never been scanned before or caching(to disk) the net rule hits that it gets when it does the (net-enabled)scan. In either case eliminating ever having to do the net checks onthe message again.

If for some reason that's not favoured, I'd settle for a --reuse-onlyrun that includes all of your messages for set0 results and onlyreusable messages for set1 results... all done in a single mass-check.

If you're running with set0 only your detectionrate already sucks, and if you're running with set1 you'll only get thenew rules once a week.
Can we not just assume that it's safe to copy the set0 scores for
the rest of the week?

I don't believe that it is safe. Often the set1 scores are a *lot*lower than the set0 scores. The set0 scores are weighted a lot heavier(by the GA) to move the spam TP rate from 46% to 80% (seriously, checkout the scores/stats-set0 file) while set1 only moves from 88% to 96%.

If we had to just use the set0 scores I don't think I'd be comfortablewith an adjustment factor of more than 25% (that is the set1 scoreswould only be a quarter of the set0 scores).

Additionally, I think we should re-use bayes results so we can moreaccurately generate scores for set2 and 3. Otherwise I think I'm goingto just copy them over from sets0 and 1 and lower them with some randomadjustment factor.
Either of those options make sense for me.

I think we need to come up with some kind of extrapolation algorithm for
these, to be honest; I don't think 4 mass-checks are at all possible. :(

The only reason we would need 4 mass-checks is if there are meta rulesthat fire in the non-net or non-bayes scoresets that won't fire if a netor bayes rule does fire. I'm not aware of any such rules, but it'spossible for it to happen (although I'd rather just let the GA decidewhether or not the rule should be used by the net or bayes scoresetrather than the meta rule). Otherwise, we can extract everything weneed from a single mass-check.



Daryl

Nightly score generation for all scoresets

Reply via email to