I don't see how "pollution" justifies this much work.  Certainly the 
"pollution" from invalid work is less important than the pollution from 
a science application that selectively kills unfavorable work.

That said, the simplest solution may be to treat each device as a 
separate host.  Different host ID, different queue, everything.

Cruncher with three CUDA cards, four host ids.

On 3/28/2010 1:21 PM, Raistmer wrote:
> Unfortunately, binding quota to device type (instead of device instance)
> will not solve current issues with multy-GPU hosts.
> Such hosts (or hosts with multy-core GPU) can do correct computations on one
> GPU (GPU core) and incorrect (for example, constantly throwing -9 overflow
> in SETI project) ones on another.
> IMHO no need to implement full-scale scheduling algorithms (I suppose this
> thing you called modeling) per-device basis.
> All that would be needed is just additional field in structure that
> describes device.
> When work assigned to device BOINC knows to what particular device it
> assigns particular task. Then it could check (client, not server) outcome of
> this particular result (was computational error or not) and update
> corresponding field in structure for particular device.
> Sure, it can't catch invalid results, invalid status will be known only
> after validation, i.e. server should be involved.
> But such simplified mechanism could check computational (in particular,
> SETI's -9 overflow or CUDA-specific -1, not implemented) errors.
> Unfortunately, there are complications, overflow can be thrown for
> completely valid result too, but here rate of such errors could play some
> role...
> As bigger extention, BOINC client could attach additional field with device
> ID when reporting result to server.
> On next request server could tell client updated good/bad ratio for each
> device ID. Devices with poor good/bad ratios could be disabled for some
> period of time (smth like device-wide backoff in computations). Here
> server-side changes required, but again, no need to do full-scale scheduling
> on per-device basis. Actually, scheduling should not be touched at all.
> BOINC client could just disable/enable corresponding devices according to
> device good/bad ratio (this would just decrease number of devices available
> for scheduling, AFAIK BOINC currently should deal with same situation. For
> example, number of available devices changes when user starts "no-GPU" app).
>
> ----- Original Message -----
> From: "David Anderson"<[email protected]>
> To: "Raistmer"<[email protected]>
> Cc:<[email protected]>
> Sent: Sunday, March 28, 2010 11:23 PM
> Subject: Re: [boinc_dev] BOINC's Quota system needs change
>
>
>> The new system (see updated doc:
>> http://boinc.berkeley.edu/trac/wiki/CreditNew)
>> will have separate quotas and error rates per resource type
>> (CPU, NVIDIA, ATI).
>>
>> Maintaining these separately for each GPU would require
>> modeling multiple GPUs separately,
>> rather than as N instances of the same thing as is currently done.
>> This would be a sweeping change, and won't get done in the near term.
>>
>> -- David
>>
>> Raistmer wrote:
>>> If hosts' task quota computed in old way, host that does valid CPU
>>> computations but invalid GPU ones will pollute database and waste project
>>> resource indefinitely.
>>> GPU usually much faster than CPU so many invalid tasks can be returned
>>> per single valid one.
>>> Moreover, even if CPU/GPU quota separation will be introducted, there are
>>> still multi GPU hosts that can pollute database with even bigger rate
>>> doing correct computations on one GPU and invalid ones on anothers.
>>> Current quota system applicable only to single host-single device
>>> approach and apparently should be changed.
>>> Right now I have no good idea what replacement can be, but this question
>>> definitely deserves consideration.
>>>
>>> One possible solution could be to track good/bad results ratio per
>>> hardvare device (not per host) and inhibit work fetch for whole host if
>>> one of its devices has too bad good/bad ratio. Or issue some instruction
>>> to BOINC client to block affected device from reciving work (it could be
>>> more graceful approach).
>>> More ideas?
>>>
>>> _______________________________________________
>>> boinc_dev mailing list
>>> [email protected]
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>>
>
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to