Re: 3.0.5 rescoring

Henry Stern Thu, 01 Dec 2005 15:01:23 -0800

I'd expect that the 700k message corpus will be more prone to errors
than the 2M message corpus.  It still might be good enough.


I'm not convinced that rescoring (as opposed to putting in new rules)
will do much for 3.0.5's accuracy.  If people really want to go to the
trouble of running the mass-checks, I won't say no to generating the
scores.  However, I can't promise that they will be any good.

Cheers,
Henry

Justin Mason wrote:
> Actually, the problem that Theo is highlighting is not that we don't have
> any contributors for rescoring mass-checks using smaller corpora; we do
> (and more are definitely welcome!)
>
> The problem is that these small corpora become "background noise" compared
> to the big, 700k-message corpora -- myself (34%), and Theo (31%).
> What we need to do to fix this problem, is come up with ways to avoid
> letting big corpora "drown out" the little ones.
>
> I think if we limit each corpora to a certain max percentage of the total,
> we could do this -- e.g. if a corpus makes up more than (100 /
> num_contributors)%, then any excess above that percentage is dropped,
> favouring recent mails over older ones.  (This post-processing step
> is doable with mass-check logs btw, we can write a script to do this.)
>
> The downside would be that we would then have "only" a 700,000-message
> corpus (or so) instead of a 2,000,000-message one.  Henry, is that OK?
>
> --j.

signature.asc
Description: OpenPGP digital signature

Re: 3.0.5 rescoring

Reply via email to