On Mon, 01 Aug 2016 16:49:15 +0100, John Hardin <[email protected]> wrote:
My fear is that we start the scoring when we receive a small (20k ham
corpus) that just barely meets the threshold and then ignore a large
(100k ham corpus) that is received shortly thereafter and that would
greatly improve the results.
Perhaps: if we receive a delayed corpus that crosses the threshold, we
don't *immediately* start scoring, instead we start in half an hour -
this gives a chance for another corpus to come in. This would continue
up to some maximum (1h?).
Or perhaps I'm overthinking it. :)
Looking at the latest net run there were just over 160k ham. Only three
corpora go over 20k ham and the largest is just under 80k ham. The largest
single corpus that could come after the threshold is reached is just over
6k. I think your ideas of big and small are optimistic, and if that really
happened we may not have anything to worry about.
The odd thing is a lot of the smaller corpora probably add some of the
most useful variety. I seem to recall someone from Norway was recently
looking to get involved for example? A couple of thousand ham from there
may make more impact than an extra couple of thousand from the same
sources. For that net run we had 8 people uploading 11 sets of data - a
bit more breadth wouldn't hurt. At the moment that's the bigger problem
really, and I have even less idea how to help there.
Something that Jari touches on is that there's not really any info on when
we need to submit by. I remember seeing something that said to start as
soon after 9am GMT as possible but I don't recall a deadline for when it
had to be in by. I know mine is usually uploaded by 1pm GMT and have
always assumed that was early enough but tbh nobody has ever said anything
about it. The only feedback I ever got was when I started too early. If
I'm late I can look at faster options, but until someone tells me I assume
I'm getting mine in on time.
If I knew when the deadline was, or why that was chosen, I may have an
opinion on factoring in extra delays. Capping that additional window makes
sense since at some point the bullet needs biting to get something out the
door, but I'd still be disinclined. It potentially waits an hour even if
it was the last upload that could happen. What if that then pushes it past
the daily cutoff? Or should we only allow that extension before a certain
point in the day to avoid that problem? It's not that we don't want that
additional data, because the more the merrier, just that it seems to
require a lot of extra factors to work as well as it should. Now I'm
overthinking it!
I dislike the idea of trying to calculate a hard start cutoff based on
how long the scoring run takes. Do we really want to maintain statistics
on that?
Probably not. Again, maybe if/when things get busier it may prove more
worthwhile but at the moment it's likely poor reward for the effort.
OK, so the hard starting cutoff could be the time the following pass
does its SVN get. If the scoring is underway at that point, we let it
run to completion? I am makign an assumption here, that the time the
scoring and rule generation takes is less than the get -> minimum
scoring start delay, so that the scoring+rulegen passes won't overlap.
It seems simple and reasonable to me.