Re: Nightly score generation for all scoresets

Justin Mason Fri, 19 Oct 2007 05:12:37 -0700

Daryl C. W. O'Shea writes:
> [to the dev@ list]
> 
> Justin Mason wrote:
> > Daryl C. W. O'Shea writes:
> >> We're lacking 
> >> data.  We really need to do nightly net enabled checks for the updates 
> >> to be really useful.
> > 
> > urgh.  that'd be tricky.  I don't know if you've noticed, but the
> > --net mass-check corpus is a *lot* smaller than the set0 one,
> > purely because it takes so much longer :(
> 
> That's dependent on whether or not people have already scanned their 
> corpus messages.  If they're all already scanned it runs at the same speed.
> 
> How about extending mass-check to either markup corpus messages that it 
> scans (while net-enabled) that have never been scanned before or caching 
> (to disk) the net rule hits that it gets when it does the (net-enabled) 
> scan.  In either case eliminating ever having to do the net checks on 
> the message again.
> 
> If for some reason that's not favoured, I'd settle for a --reuse-only 
> run that includes all of your messages for set0 results and only 
> reusable messages for set1 results... all done in a single mass-check.


+1
OK, I like that.  We should not be attempting to use non-reused results
for rescoring, at all, given the temporal sensitivity of net-rule lookups.

We should keep the "full" --net run at the weekends, which can do net
lookups against non-reused messages, to measure new dev rules.

mass-check logs the status of reuse in the output lines, btw, logging
either "reuse=yes" or "reuse=no", so we should be able to estimate
usability of this now...

> >> If you're running with set0 only your detection 
> >> rate already sucks, and if you're running with set1 you'll only get the 
> >> new rules once a week.
> > 
> > Can we not just assume that it's safe to copy the set0 scores for
> > the rest of the week?
> 
> I don't believe that it is safe.  Often the set1 scores are a *lot* 
> lower than the set0 scores.  The set0 scores are weighted a lot heavier 
> (by the GA) to move the spam TP rate from 46% to 80% (seriously, check 
> out the scores/stats-set0 file) while set1 only moves from 88% to 96%.
> 
> If we had to just use the set0 scores I don't think I'd be comfortable 
> with an adjustment factor of more than 25% (that is the set1 scores 
> would only be a quarter of the set0 scores).

wow.  those are big differences :(

ok, if we can get the --reuse-only trick working, I think that'll
work fine -- allowing nightly set1 mass-checks without taking forever.

> >> Additionally, I think we should re-use bayes results so we can more 
> >> accurately generate scores for set2 and 3.  Otherwise I think I'm going 
> >> to just copy them over from sets0 and 1 and lower them with some random 
> >> adjustment factor.
> > 
> > Either of those options make sense for me.
> > 
> > I think we need to come up with some kind of extrapolation algorithm for
> > these, to be honest; I don't think 4 mass-checks are at all possible. :(
> 
> The only reason we would need 4 mass-checks is if there are meta rules 
> that fire in the non-net or non-bayes scoresets that won't fire if a net 
> or bayes rule does fire.  I'm not aware of any such rules, but it's 
> possible for it to happen (although I'd rather just let the GA decide 
> whether or not the rule should be used by the net or bayes scoreset 
> rather than the meta rule).  Otherwise, we can extract everything we 
> need from a single mass-check.

yeah, I'm not worried about those cases.

--j.

Re: Nightly score generation for all scoresets

Reply via email to