I for one, like Lynn's approach best. I do not have a clue how much work would be involved, but I like the approach of being able to easily see how each device is performing (if this is in-fact a result of assign each device a host id)
-------- Jim Preston BOINC Support Volunteer [email protected] SKYPE jhparizona-boinc On Mar 28, 2010, at 1:51 PM, Lynn W. Taylor wrote: > I don't see how "pollution" justifies this much work. Certainly the > "pollution" from invalid work is less important than the pollution > from > a science application that selectively kills unfavorable work. > > That said, the simplest solution may be to treat each device as a > separate host. Different host ID, different queue, everything. > > Cruncher with three CUDA cards, four host ids. > > On 3/28/2010 1:21 PM, Raistmer wrote: >> Unfortunately, binding quota to device type (instead of device >> instance) >> will not solve current issues with multy-GPU hosts. >> Such hosts (or hosts with multy-core GPU) can do correct >> computations on one >> GPU (GPU core) and incorrect (for example, constantly throwing -9 >> overflow >> in SETI project) ones on another. >> IMHO no need to implement full-scale scheduling algorithms (I >> suppose this >> thing you called modeling) per-device basis. >> All that would be needed is just additional field in structure that >> describes device. >> When work assigned to device BOINC knows to what particular device it >> assigns particular task. Then it could check (client, not server) >> outcome of >> this particular result (was computational error or not) and update >> corresponding field in structure for particular device. >> Sure, it can't catch invalid results, invalid status will be known >> only >> after validation, i.e. server should be involved. >> But such simplified mechanism could check computational (in >> particular, >> SETI's -9 overflow or CUDA-specific -1, not implemented) errors. >> Unfortunately, there are complications, overflow can be thrown for >> completely valid result too, but here rate of such errors could >> play some >> role... >> As bigger extention, BOINC client could attach additional field >> with device >> ID when reporting result to server. >> On next request server could tell client updated good/bad ratio for >> each >> device ID. Devices with poor good/bad ratios could be disabled for >> some >> period of time (smth like device-wide backoff in computations). Here >> server-side changes required, but again, no need to do full-scale >> scheduling >> on per-device basis. Actually, scheduling should not be touched at >> all. >> BOINC client could just disable/enable corresponding devices >> according to >> device good/bad ratio (this would just decrease number of devices >> available >> for scheduling, AFAIK BOINC currently should deal with same >> situation. For >> example, number of available devices changes when user starts "no- >> GPU" app). >> >> ----- Original Message ----- >> From: "David Anderson"<[email protected]> >> To: "Raistmer"<[email protected]> >> Cc:<[email protected]> >> Sent: Sunday, March 28, 2010 11:23 PM >> Subject: Re: [boinc_dev] BOINC's Quota system needs change >> >> >>> The new system (see updated doc: >>> http://boinc.berkeley.edu/trac/wiki/CreditNew) >>> will have separate quotas and error rates per resource type >>> (CPU, NVIDIA, ATI). >>> >>> Maintaining these separately for each GPU would require >>> modeling multiple GPUs separately, >>> rather than as N instances of the same thing as is currently done. >>> This would be a sweeping change, and won't get done in the near >>> term. >>> >>> -- David >>> >>> Raistmer wrote: >>>> If hosts' task quota computed in old way, host that does valid CPU >>>> computations but invalid GPU ones will pollute database and waste >>>> project >>>> resource indefinitely. >>>> GPU usually much faster than CPU so many invalid tasks can be >>>> returned >>>> per single valid one. >>>> Moreover, even if CPU/GPU quota separation will be introducted, >>>> there are >>>> still multi GPU hosts that can pollute database with even bigger >>>> rate >>>> doing correct computations on one GPU and invalid ones on anothers. >>>> Current quota system applicable only to single host-single device >>>> approach and apparently should be changed. >>>> Right now I have no good idea what replacement can be, but this >>>> question >>>> definitely deserves consideration. >>>> >>>> One possible solution could be to track good/bad results ratio per >>>> hardvare device (not per host) and inhibit work fetch for whole >>>> host if >>>> one of its devices has too bad good/bad ratio. Or issue some >>>> instruction >>>> to BOINC client to block affected device from reciving work (it >>>> could be >>>> more graceful approach). >>>> More ideas? >>>> >>>> _______________________________________________ >>>> boinc_dev mailing list >>>> [email protected] >>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>>> To unsubscribe, visit the above URL and >>>> (near bottom of page) enter your email address. >>> >> >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >> > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
