Re: [boinc_dev] host punishment mechanism revisited

Lynn W. Taylor Mon, 24 May 2010 15:03:00 -0700

Richard,

Certainly the idea makes some assumptions that aren't entirely valid, 
and your comments about the SETI -9's and associated fast work units are 
a good example.


I don't really see a way around that.

It may be that the current algorithm is pretty good, with some minor 
changes.

1) Raise the 100/day limit dramatically -- settable by project.

2) If a work unit errors, reduce the daily quota per the current 
algorithm, and then check to see if it is over some limit, like 100.

If the reduced quota is more than 100, reduce it to 100.

3) If the quota is less than some reasonable number (maybe 100 works 
here as well) and a good work unit is received, raise the quota as now, 
and if it's less than 100, raise it to 100.

Seems that the only thing we really want to do is keep a broken host 
from loading up on work when all it really does is cause reissues.  The 
worst case is a machine running BOINC that isn't being monitored at all. 
  If it takes weeks to slow it down, that doesn't really hurt, does it?

In the grand scheme of things, if it takes a while for the quota limit 
to drop, that's okay, because the reissues aren't lost work.

-- Lynn

On 5/24/2010 2:01 PM, Richard Haselgrove wrote:
> I've never noticed point (2) to be a problem.Successful returns increase the
> quota for the *current* day, not for subsequent days: so, in the worst case
> scenario (unless the doubling algorithm has been changed with the new server
> code):
>
> The user only notices the problem after quota has already dropped to 1 per
> day, and far enough into the day that the day's single job has already been
> wasted. In this case, no new work can be fetched until after the server's
> midnight, even to test whether the user's fix has been successful. That can
> be the most frustrating part.
>
> After midnight, one new task can be fetched - per CPU core, in the current
> model. How does that scale with GPUs? I think it should continue to scale so
> that every machine resource can be supplied with the single daily 'test WU':
> if the host has four GPUs, it should be allowed four GPU-tasks.
>
> No further work is allowed until the first task has been reported as
> 'success'. Time is wasted if there's a long file upload to complete, but I
> don't think that can be avoided. But after the first task has reported,
> another task is immediately permitted to download. If the second task is
> also successful, quota becomes four: two have been used, so two can be
> downloaded - one to run immediately, one in reserve to start when #3
> finishes. From that point forward, you're ahead of the game - no more time
> is wasted, and the doubling soon restores full service. On a multicore, it's
> even quicker: with four cores, by the end of the first test set of four
> (which always seem to finish at slightly different times), quota has doubled
> four times to 16 per core, or 64 for the computer as a whole: four
> completed, four in progress, already means 56 available and ready to run.
> That's plenty.
>
> I agree with Lynn's point (often suggested) that the starting point for
> newly-attached hosts should be set much lower than the ultimate limit
> achievable by reliable hosts. Set a low starting 'probationary' quota - even
> as low as two per core, for "1 running and one spare to follow" - and allow
> it to double as now.
>
> The maximum value that quota can reach should be determined by each project.
> 100 is already too low for SETI GPUs, but ludicrously high for CPDN. I think
> we've already covered the point that it should be variable by application
> (AQUA FP take an hour, need quota at least 50/day: AQUA IQ takes several
> days, quota 2/day is plenty). And that's without considering CPU/GPU
> versions of the same app.
>
> Allowing a 'bonus' on quota for a validated task gets round the astronomical
> numbers that can be processed by "successful, but idiotic" reports such as
> SETI overflows on faulty GPUs. But it suffers from the asynchronous nature
> of validation: why should I deserve a bonus task today, if my oldest pending
> task - returned 8 February - happens to be validated by its fifth potential
> wingmate?
>
>
>> The BOINC scheduler has a mechanism called "host punishment"
>> designed to deal with hosts that request an infinite sequence of jobs,
>> and either error out on them or never return them.
>>
>> It works like this: there's a project parameter called "daily result
>> quota",
>> say 100.  Every host starts off with this quota.
>> If it returns an error or times out, the quota is decremented down to,
>> but not below, 1.  If it returns a valid result, the quota is doubled.
>> The idea is that faulty hosts are given 1 job per day
>> to see if they've been fixed.
>>
>> Recently this mechanism was changed from per-project to per-app-version,
>> the idea being that a host might be erroring out on a GPU version
>> but not the CPU version.
>>
>> However, the basic mechanism is somewhat flawed:
>>
>> 1) What if a fast host can do more than 100 jobs a day?
>> We could increase the default quota, but that would let bad hosts
>> trash that many more jobs.
>>
>> 2) It takes too long for a fixed host to ramp up its quota.
>>
>> The bottom line: as long as a host is sending correct results,
>> it shouldn't have a daily quota at all.
>>
>> ---------
>>
>> If anyone has ideas for how to change the host punishment mechanism,
>> please let me know.
>> I'll think about it and post a proposal at some point.
>>
>> -- David
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
>>
>
>
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] host punishment mechanism revisited

Reply via email to