Re: masscheck process timing

Kevin Golding Mon, 01 Aug 2016 10:17:55 -0700

On Mon, 01 Aug 2016 16:49:15 +0100, John Hardin <[email protected]> wrote:

My fear is that we start the scoring when we receive a small (20k hamcorpus) that just barely meets the threshold and then ignore a large(100k ham corpus) that is received shortly thereafter and that wouldgreatly improve the results.
Perhaps: if we receive a delayed corpus that crosses the threshold, wedon't *immediately* start scoring, instead we start in half an hour -this gives a chance for another corpus to come in. This would continueup to some maximum (1h?).
Or perhaps I'm overthinking it. :)

Looking at the latest net run there were just over 160k ham. Only threecorpora go over 20k ham and the largest is just under 80k ham. The largestsingle corpus that could come after the threshold is reached is just over6k. I think your ideas of big and small are optimistic, and if that reallyhappened we may not have anything to worry about.

The odd thing is a lot of the smaller corpora probably add some of themost useful variety. I seem to recall someone from Norway was recentlylooking to get involved for example? A couple of thousand ham from theremay make more impact than an extra couple of thousand from the samesources. For that net run we had 8 people uploading 11 sets of data - abit more breadth wouldn't hurt. At the moment that's the bigger problemreally, and I have even less idea how to help there.

Something that Jari touches on is that there's not really any info on whenwe need to submit by. I remember seeing something that said to start assoon after 9am GMT as possible but I don't recall a deadline for when ithad to be in by. I know mine is usually uploaded by 1pm GMT and havealways assumed that was early enough but tbh nobody has ever said anythingabout it. The only feedback I ever got was when I started too early. IfI'm late I can look at faster options, but until someone tells me I assumeI'm getting mine in on time.

If I knew when the deadline was, or why that was chosen, I may have anopinion on factoring in extra delays. Capping that additional window makessense since at some point the bullet needs biting to get something out thedoor, but I'd still be disinclined. It potentially waits an hour even ifit was the last upload that could happen. What if that then pushes it pastthe daily cutoff? Or should we only allow that extension before a certainpoint in the day to avoid that problem? It's not that we don't want thatadditional data, because the more the merrier, just that it seems torequire a lot of extra factors to work as well as it should. Now I'moverthinking it!

I dislike the idea of trying to calculate a hard start cutoff based onhow long the scoring run takes. Do we really want to maintain statisticson that?

Probably not. Again, maybe if/when things get busier it may prove moreworthwhile but at the moment it's likely poor reward for the effort.

OK, so the hard starting cutoff could be the time the following passdoes its SVN get. If the scoring is underway at that point, we let itrun to completion? I am makign an assumption here, that the time thescoring and rule generation takes is less than the get -> minimumscoring start delay, so that the scoring+rulegen passes won't overlap.


It seems simple and reasonable to me.

Re: masscheck process timing

Reply via email to