BTW, for this particular example of -9 overflow error maybe it's worth (for SETI project) count such message as "real" error in quota computations (at least if it comes from GPU-assigned task).
There are possibility for valid -9 overflow just from noisy task, but with newly implemented radar blanking number of such noisy tasks should be diminished. Also, if host returns both "regular" results and overflows it should not inhibit its work fetches completely. But broken CUDA host can be real disaster with reporting not computational errors but just invalid tasks and should be stopped ASAP indeed. ----- Original Message ----- From: "Richard Haselgrove" <[email protected]> To: "Lynn W. Taylor" <[email protected]>; <[email protected]> Sent: Tuesday, May 25, 2010 2:57 AM Subject: Re: [boinc_dev] host punishment mechanism revisited > I'm afraid this isn't going to work. > > (Please excuse us while we, yet again, discuss a SETI-specific problem) > > There's a frequently observed, but cause unknown, failure mode for the > SETI > CUDA application. > > Once a CUDA card gets into this state, nothing seems to stop it except a > host reboot - i.e. manual intervention. > > The symptoms of the failure are that the CUDA card in question exits every > task as an immediate -9 overflow, generates the matching maximum-size > upload > file, and reports the task as "success". At a conservative average of one > task per minute (actually it's quicker than that), the host could recycle > 1,500 tasks per day. > > The current local DCF will be falling, so more tasks will be requested > than > returned: even an instananeous cut-off would leave a large cache to be > blown > off. > > The early tasks, at least, will have a pretty fast turnround: it's likely > that their quorum partner won't have returned yet. When the first wingmate > does come back (possibly not for several days), validation will be > inconclusive: a third quorum member will be generated, queued, issued, > cached and eventually returned. That's the *first* moment at which the > pseudo -9 can be declared invalid. > > If we wait for validation failures to start to punish hosts with this > class > of failure, it'll be far too slow. The low basic quota, with replacement > for > successful validations, seems likely to be a better protaction against > runaway hosts. > > >> The goal is to not feed work to "broken" CPUs or GPUs. >> >> What about this: >> >> Each day, keep track of validations. If a work unit validates, and no >> invalid work is received, raise quota. If no work is validated that >> day, quota stays unchanged. If no valid work is received, and invalid >> work is received, leave the quota unchanged. >> >> I think that mixed results for a day (some valid, some invalid) should >> leave the quota unchanged. >> >> That would keep a long string of SETI -9's from killing the quota -- >> it'd take days of sustained badness to stop a broken host. >> >> -- Lynn >> >> On 5/24/2010 2:55 PM, Richard Haselgrove wrote: >>>>> Allowing a 'bonus' on quota for a validated task gets round the >>>>> astronomical >>>>> numbers that can be processed by "successful, but idiotic" reports >>>>> such >>>>> as >>>>> SETI overflows on faulty GPUs. >>>> >>>> Such results will be returned, but NOT validated. So they don't recive >>>> "validation bonus". >>> >>> Exactly. that's why it gets round - i.e. solves or avoids - the problem >>> that >>> could be caused by an inflated general quota figure. >>> >>> >>> _______________________________________________ >>> boinc_dev mailing list >>> [email protected] >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>> To unsubscribe, visit the above URL and >>> (near bottom of page) enter your email address. >>> >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >> > > > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
