I don't see how "pollution" justifies this much work. Certainly the "pollution" from invalid work is less important than the pollution from a science application that selectively kills unfavorable work.
That said, the simplest solution may be to treat each device as a separate host. Different host ID, different queue, everything. Cruncher with three CUDA cards, four host ids. On 3/28/2010 1:21 PM, Raistmer wrote: > Unfortunately, binding quota to device type (instead of device instance) > will not solve current issues with multy-GPU hosts. > Such hosts (or hosts with multy-core GPU) can do correct computations on one > GPU (GPU core) and incorrect (for example, constantly throwing -9 overflow > in SETI project) ones on another. > IMHO no need to implement full-scale scheduling algorithms (I suppose this > thing you called modeling) per-device basis. > All that would be needed is just additional field in structure that > describes device. > When work assigned to device BOINC knows to what particular device it > assigns particular task. Then it could check (client, not server) outcome of > this particular result (was computational error or not) and update > corresponding field in structure for particular device. > Sure, it can't catch invalid results, invalid status will be known only > after validation, i.e. server should be involved. > But such simplified mechanism could check computational (in particular, > SETI's -9 overflow or CUDA-specific -1, not implemented) errors. > Unfortunately, there are complications, overflow can be thrown for > completely valid result too, but here rate of such errors could play some > role... > As bigger extention, BOINC client could attach additional field with device > ID when reporting result to server. > On next request server could tell client updated good/bad ratio for each > device ID. Devices with poor good/bad ratios could be disabled for some > period of time (smth like device-wide backoff in computations). Here > server-side changes required, but again, no need to do full-scale scheduling > on per-device basis. Actually, scheduling should not be touched at all. > BOINC client could just disable/enable corresponding devices according to > device good/bad ratio (this would just decrease number of devices available > for scheduling, AFAIK BOINC currently should deal with same situation. For > example, number of available devices changes when user starts "no-GPU" app). > > ----- Original Message ----- > From: "David Anderson"<[email protected]> > To: "Raistmer"<[email protected]> > Cc:<[email protected]> > Sent: Sunday, March 28, 2010 11:23 PM > Subject: Re: [boinc_dev] BOINC's Quota system needs change > > >> The new system (see updated doc: >> http://boinc.berkeley.edu/trac/wiki/CreditNew) >> will have separate quotas and error rates per resource type >> (CPU, NVIDIA, ATI). >> >> Maintaining these separately for each GPU would require >> modeling multiple GPUs separately, >> rather than as N instances of the same thing as is currently done. >> This would be a sweeping change, and won't get done in the near term. >> >> -- David >> >> Raistmer wrote: >>> If hosts' task quota computed in old way, host that does valid CPU >>> computations but invalid GPU ones will pollute database and waste project >>> resource indefinitely. >>> GPU usually much faster than CPU so many invalid tasks can be returned >>> per single valid one. >>> Moreover, even if CPU/GPU quota separation will be introducted, there are >>> still multi GPU hosts that can pollute database with even bigger rate >>> doing correct computations on one GPU and invalid ones on anothers. >>> Current quota system applicable only to single host-single device >>> approach and apparently should be changed. >>> Right now I have no good idea what replacement can be, but this question >>> definitely deserves consideration. >>> >>> One possible solution could be to track good/bad results ratio per >>> hardvare device (not per host) and inhibit work fetch for whole host if >>> one of its devices has too bad good/bad ratio. Or issue some instruction >>> to BOINC client to block affected device from reciving work (it could be >>> more graceful approach). >>> More ideas? >>> >>> _______________________________________________ >>> boinc_dev mailing list >>> [email protected] >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>> To unsubscribe, visit the above URL and >>> (near bottom of page) enter your email address. >> > > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
