Re: [boinc_dev] host punishment mechanism revisited

Raistmer Thu, 27 May 2010 02:46:52 -0700

"Punishment" for invalid validation will not help at all IMO with broken 
CUDA GPUs.
Tasks reported in seconds scale, validated in weeks scale and time of 
existance such broken GPU ~day (host reboot will fix it).
That is, in most cases this delayed validation accounting will "punish" for 
long ago forgotten problem. Not very effective way to deal with this 
problem.


----- Original Message ----- 
From: <[email protected]>
To: "Kevin Reed" <[email protected]>
Cc: "BOINC Developers Mailing List" <[email protected]>; 
<[email protected]>
Sent: Thursday, May 27, 2010 9:18 AM
Subject: Re: [boinc_dev] host punishment mechanism revisited


>+ and - 1 are too slow to be of much use unless the range is very short
> (i.e. 1 to 10 with 10 being completely trusted, and 1 being completely
> untrusted).  If we split the allocation between report and validation, we
> could have +1 for success, +1 for validation (for a total of +2 for a
> successful valid result), and -1 for error and -2 for invalid (for -1 for
> each of the cases of unsuccessful and successful but invalid).  The 
> invalid
> punishment has to be -(success reward) + (whatever the punishment for an
> invalid result should be - which is a negative number).  To make it
> completely balanced, it would be 0.5 for reward for each of success and
> valid, and -1 for unsuccessful and -1.5 for successful but invalid.
>
> We also need to catch validation errors has has been proven by the ongoing
> CUDA problem where CUDA cards start returning junk after a while and it
> takes a computer reboot to correct the problem.
>
> Which is the more stringent requirement depends on run times.  If a task
> takes minutes to error out, the maximum # of tasks in progress is the more
> stringent.  If it takes over a day / task, then the max # downloads is the
> more stringent requirement.  Since the project should protect itself
> against slow problems as well as fast problems, I am not completely 
> certain
> that is the right division.
>
> Perhaps if a host is in state #2, we allow a per individual resource
> allocation / day.  The allocation in state #2 should reduce as we become
> more suspicious of the resource type.  If the host is in state #3, we 
> allow
> an allocation per resource type / day (preferably an allocation of one for
> each of CPU and each GPU type).  If we know it is a problem, we are 
> waiting
> for the host to be fixed.
>
> I would suggest that the punishment and quota be per resource type instead
> of per host.  It is quite possible that the CPU is crunching merrily along
> but the GPU is failing every task.  Of course, if there are two different
> brands of GPU, each one could have different failure modes...
>
> jm7
>
>
>
>             Kevin Reed
>             <[email protected]
>             m>                                                         To
>             Sent by:                  David Anderson
>             <boinc_dev-bounce         <[email protected]>
>             [email protected]                                          cc
>             u>                        BOINC Developers Mailing List
>                                       <[email protected]>
>                                                                   Subject
>             05/26/2010 05:04          Re: [boinc_dev] host punishment
>             PM                        mechanism revisited
>
>
>
>
>
>
>
>
>
>
>
> David,
>
> I would suggest that there is three states for a host:
>
> 1) It has proven to be reliable
> 2) We are suspicious of it
> 3) We know it has a problem
>
> If it is in state #1, then it should be allowed to compute without limit
> If it is in state #2, then it should have a limit on the max number of
> results it can have in progress for the app version
> If it is in state #3, then it should have a limit on the max number of
> results for the day that it can process
>
> There would be a parameter that would be something like
> <host_app_version_limit>X<host_app_version_limit>.  Each host has a
> host_app_version value 'Y'.
>
> Y is initialized for a new host to 50% of X
> When a success result is returned Y is incremented by one until it reaches
> X
> When a error result is returned Y is decremented by one until it reaches X
>
> When Y == X, then the host has no restrictions based on this value for the
> # of results in progress it can have for app version
> When Y == 1, then the host can only have one result per day for the app
> version
> When Y < X && Y > 1 then the host can have Y results in progress per
> processing unit for the app_version
>
> This mechanism will allow a computer to run an unlimited amount of work 
> per
> day as long as Y is greater than 1.  It just can't build up a large cache
> unless Y==X.
>
> This mechanism also should have a limited impact on the database as most
> computers will be at either Y == 1 or Y == X so there will be few queries
> to see how many results in progress there are.
>
> Kevin Reed
>
>
>
>
> From:        David Anderson <[email protected]>
> To:          BOINC Developers Mailing List <[email protected]>
> Date:        05/24/2010 02:03 PM
> Subject:           host punishment mechanism revisited
>
>
>
> The BOINC scheduler has a mechanism called "host punishment"
> designed to deal with hosts that request an infinite sequence of jobs,
> and either error out on them or never return them.
>
> It works like this: there's a project parameter called "daily result
> quota",
> say 100.  Every host starts off with this quota.
> If it returns an error or times out, the quota is decremented down to,
> but not below, 1.  If it returns a valid result, the quota is doubled.
> The idea is that faulty hosts are given 1 job per day
> to see if they've been fixed.
>
> Recently this mechanism was changed from per-project to per-app-version,
> the idea being that a host might be erroring out on a GPU version
> but not the CPU version.
>
> However, the basic mechanism is somewhat flawed:
>
> 1) What if a fast host can do more than 100 jobs a day?
> We could increase the default quota, but that would let bad hosts
> trash that many more jobs.
>
> 2) It takes too long for a fixed host to ramp up its quota.
>
> The bottom line: as long as a host is sending correct results,
> it shouldn't have a daily quota at all.
>
> ---------
>
> If anyone has ideas for how to change the host punishment mechanism,
> please let me know.
> I'll think about it and post a proposal at some point.
>
> -- David
> (See attached file: graycol.gif)
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.


--------------------------------------------------------------------------------


> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address. 

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] host punishment mechanism revisited

Reply via email to