I'd expect that the 700k message corpus will be more prone to errors than the 2M message corpus. It still might be good enough.
I'm not convinced that rescoring (as opposed to putting in new rules) will do much for 3.0.5's accuracy. If people really want to go to the trouble of running the mass-checks, I won't say no to generating the scores. However, I can't promise that they will be any good. Cheers, Henry Justin Mason wrote: > Actually, the problem that Theo is highlighting is not that we don't have > any contributors for rescoring mass-checks using smaller corpora; we do > (and more are definitely welcome!) > > The problem is that these small corpora become "background noise" compared > to the big, 700k-message corpora -- myself (34%), and Theo (31%). > What we need to do to fix this problem, is come up with ways to avoid > letting big corpora "drown out" the little ones. > > I think if we limit each corpora to a certain max percentage of the total, > we could do this -- e.g. if a corpus makes up more than (100 / > num_contributors)%, then any excess above that percentage is dropped, > favouring recent mails over older ones. (This post-processing step > is doable with mass-check logs btw, we can write a script to do this.) > > The downside would be that we would then have "only" a 700,000-message > corpus (or so) instead of a 2,000,000-message one. Henry, is that OK? > > --j.
signature.asc
Description: OpenPGP digital signature
