I'm afraid this isn't going to work. (Please excuse us while we, yet again, discuss a SETI-specific problem)
There's a frequently observed, but cause unknown, failure mode for the SETI CUDA application. Once a CUDA card gets into this state, nothing seems to stop it except a host reboot - i.e. manual intervention. The symptoms of the failure are that the CUDA card in question exits every task as an immediate -9 overflow, generates the matching maximum-size upload file, and reports the task as "success". At a conservative average of one task per minute (actually it's quicker than that), the host could recycle 1,500 tasks per day. The current local DCF will be falling, so more tasks will be requested than returned: even an instananeous cut-off would leave a large cache to be blown off. The early tasks, at least, will have a pretty fast turnround: it's likely that their quorum partner won't have returned yet. When the first wingmate does come back (possibly not for several days), validation will be inconclusive: a third quorum member will be generated, queued, issued, cached and eventually returned. That's the *first* moment at which the pseudo -9 can be declared invalid. If we wait for validation failures to start to punish hosts with this class of failure, it'll be far too slow. The low basic quota, with replacement for successful validations, seems likely to be a better protaction against runaway hosts. > The goal is to not feed work to "broken" CPUs or GPUs. > > What about this: > > Each day, keep track of validations. If a work unit validates, and no > invalid work is received, raise quota. If no work is validated that > day, quota stays unchanged. If no valid work is received, and invalid > work is received, leave the quota unchanged. > > I think that mixed results for a day (some valid, some invalid) should > leave the quota unchanged. > > That would keep a long string of SETI -9's from killing the quota -- > it'd take days of sustained badness to stop a broken host. > > -- Lynn > > On 5/24/2010 2:55 PM, Richard Haselgrove wrote: >>>> Allowing a 'bonus' on quota for a validated task gets round the >>>> astronomical >>>> numbers that can be processed by "successful, but idiotic" reports such >>>> as >>>> SETI overflows on faulty GPUs. >>> >>> Such results will be returned, but NOT validated. So they don't recive >>> "validation bonus". >> >> Exactly. that's why it gets round - i.e. solves or avoids - the problem >> that >> could be caused by an inflated general quota figure. >> >> >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >> > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
