On Mon, 1 Aug 2016, Kevin Golding wrote:

On Sun, 31 Jul 2016 22:00:11 +0100, John Hardin <[email protected]> wrote:

This can use some refinement:

Some good thoughts, but ones that I fear may prove an obstacle to getting a change in place. Perhaps things for a wishlist instead?

Maybe.

If we've started scoring and another result set for that pass comes in, do we incorporate that into the score generation? We probably should; the decision could be based on when the delayed results come in (we don't want to keep resetting the scoring process and collide with the following pass) and how large the new results are (we might want to ignore a late small result set, but incorporate a late large result set).

As it stands I'm inclined to take the route that anything submitted after the run has started gets lost - this is no different to the current situation (as I understand it anyway) so it's not penalising anyone, but it also doesn't grant further concessions. Adding in new results just seems a way to potentially further delay an already delayed process.

I'm hoping to balance delay and quality of results.

Much as the additional data is beneficial it seems added complexity for no gain. Given how tight the ham threshold is most days (there are a lot of days in the 140k-150k region) a large result set is unlikely to arrive after the threshold has been met anyway, it's far more likely to be the trigger. If we start dividing large and small we need to pick a point and draw a line and potentially discourage submissions from people who feel they aren't important enough.

I'd also note that when you look at the uploads you have people like axb who submit multiple times in small groups - that is always an option to people if they feel something is important enough to beat the threshold.

My fear is that we start the scoring when we receive a small (20k ham corpus) that just barely meets the threshold and then ignore a large (100k ham corpus) that is received shortly thereafter and that would greatly improve the results.

Perhaps: if we receive a delayed corpus that crosses the threshold, we don't *immediately* start scoring, instead we start in half an hour - this gives a chance for another corpus to come in. This would continue up to some maximum (1h?).

Or perhaps I'm overthinking it. :)

If we're still running a score generation for pass X and pass X+1 has reached its cutoff and has enough corpora to satisfy the thresholds and immediately start the scoring process, do we give up on processing pass X? I would think yes.

I don't know how long the process takes, but if we never start a pass by the time the next day's start point comes I would assume it would never overlap.

I dislike the idea of trying to calculate a hard start cutoff based on how long the scoring run takes. Do we really want to maintain statistics on that?

I could be wrong, but it seems likely that a hard cut off that shouldn't overlap the next day's start may be simpler. At some point we need to give up hope on a day's results anyway, so that may be the guideline for when that time is.

OK, so the hard starting cutoff could be the time the following pass does its SVN get. If the scoring is underway at that point, we let it run to completion? I am makign an assumption here, that the time the scoring and rule generation takes is less than the get -> minimum scoring start delay, so that the scoring+rulegen passes won't overlap.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 [email protected]    FALaholic #11174     pgpk -a [email protected]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  It is not the place of government to make right every tragedy and
  woe that befalls every resident of the nation.
-----------------------------------------------------------------------
 4 days until the 281st anniversary of John Peter Zenger's acquittal

Reply via email to