On Mon, 1 Aug 2016, Kevin Golding wrote:
On Sun, 31 Jul 2016 22:00:11 +0100, John Hardin <[email protected]> wrote:
This can use some refinement:
Some good thoughts, but ones that I fear may prove an obstacle to getting a
change in place. Perhaps things for a wishlist instead?
Maybe.
If we've started scoring and another result set for that pass comes in, do
we incorporate that into the score generation? We probably should; the
decision could be based on when the delayed results come in (we don't want
to keep resetting the scoring process and collide with the following pass)
and how large the new results are (we might want to ignore a late small
result set, but incorporate a late large result set).
As it stands I'm inclined to take the route that anything submitted after the
run has started gets lost - this is no different to the current situation (as
I understand it anyway) so it's not penalising anyone, but it also doesn't
grant further concessions. Adding in new results just seems a way to
potentially further delay an already delayed process.
I'm hoping to balance delay and quality of results.
Much as the additional data is beneficial it seems added complexity for no
gain. Given how tight the ham threshold is most days (there are a lot of days
in the 140k-150k region) a large result set is unlikely to arrive after the
threshold has been met anyway, it's far more likely to be the trigger. If we
start dividing large and small we need to pick a point and draw a line and
potentially discourage submissions from people who feel they aren't important
enough.
I'd also note that when you look at the uploads you have people like axb who
submit multiple times in small groups - that is always an option to people if
they feel something is important enough to beat the threshold.
My fear is that we start the scoring when we receive a small (20k ham
corpus) that just barely meets the threshold and then ignore a large (100k
ham corpus) that is received shortly thereafter and that would greatly
improve the results.
Perhaps: if we receive a delayed corpus that crosses the threshold, we
don't *immediately* start scoring, instead we start in half an hour - this
gives a chance for another corpus to come in. This would continue up to
some maximum (1h?).
Or perhaps I'm overthinking it. :)
If we're still running a score generation for pass X and pass X+1 has
reached its cutoff and has enough corpora to satisfy the thresholds and
immediately start the scoring process, do we give up on processing pass X?
I would think yes.
I don't know how long the process takes, but if we never start a pass by the
time the next day's start point comes I would assume it would never overlap.
I dislike the idea of trying to calculate a hard start cutoff based on how
long the scoring run takes. Do we really want to maintain statistics on
that?
I could be wrong, but it seems likely that a hard cut off that shouldn't
overlap the next day's start may be simpler. At some point we need to give up
hope on a day's results anyway, so that may be the guideline for when that
time is.
OK, so the hard starting cutoff could be the time the following pass does
its SVN get. If the scoring is underway at that point, we let it run to
completion? I am makign an assumption here, that the time the scoring and
rule generation takes is less than the get -> minimum scoring start delay,
so that the scoring+rulegen passes won't overlap.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
[email protected] FALaholic #11174 pgpk -a [email protected]
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
It is not the place of government to make right every tragedy and
woe that befalls every resident of the nation.
-----------------------------------------------------------------------
4 days until the 281st anniversary of John Peter Zenger's acquittal