+ and - 1 are too slow to be of much use unless the range is very short
(i.e. 1 to 10 with 10 being completely trusted, and 1 being completely
untrusted).  If we split the allocation between report and validation, we
could have +1 for success, +1 for validation (for a total of +2 for a
successful valid result), and -1 for error and -2 for invalid (for -1 for
each of the cases of unsuccessful and successful but invalid).  The invalid
punishment has to be -(success reward) + (whatever the punishment for an
invalid result should be - which is a negative number).  To make it
completely balanced, it would be 0.5 for reward for each of success and
valid, and -1 for unsuccessful and -1.5 for successful but invalid.

We also need to catch validation errors has has been proven by the ongoing
CUDA problem where CUDA cards start returning junk after a while and it
takes a computer reboot to correct the problem.

Which is the more stringent requirement depends on run times.  If a task
takes minutes to error out, the maximum # of tasks in progress is the more
stringent.  If it takes over a day / task, then the max # downloads is the
more stringent requirement.  Since the project should protect itself
against slow problems as well as fast problems, I am not completely certain
that is the right division.

Perhaps if a host is in state #2, we allow a per individual resource
allocation / day.  The allocation in state #2 should reduce as we become
more suspicious of the resource type.  If the host is in state #3, we allow
an allocation per resource type / day (preferably an allocation of one for
each of CPU and each GPU type).  If we know it is a problem, we are waiting
for the host to be fixed.

I would suggest that the punishment and quota be per resource type instead
of per host.  It is quite possible that the CPU is crunching merrily along
but the GPU is failing every task.  Of course, if there are two different
brands of GPU, each one could have different failure modes...

jm7


                                                                           
             Kevin Reed                                                    
             <[email protected]                                             
             m>                                                         To 
             Sent by:                  David Anderson                      
             <boinc_dev-bounce         <[email protected]>            
             [email protected]                                          cc 
             u>                        BOINC Developers Mailing List       
                                       <[email protected]>        
                                                                   Subject 
             05/26/2010 05:04          Re: [boinc_dev] host punishment     
             PM                        mechanism revisited                 
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           





David,

I would suggest that there is three states for a host:

1) It has proven to be reliable
2) We are suspicious of it
3) We know it has a problem

If it is in state #1, then it should be allowed to compute without limit
If it is in state #2, then it should have a limit on the max number of
results it can have in progress for the app version
If it is in state #3, then it should have a limit on the max number of
results for the day that it can process

There would be a parameter that would be something like
<host_app_version_limit>X<host_app_version_limit>.  Each host has a
host_app_version value 'Y'.

Y is initialized for a new host to 50% of X
When a success result is returned Y is incremented by one until it reaches
X
When a error result is returned Y is decremented by one until it reaches X

When Y == X, then the host has no restrictions based on this value for the
# of results in progress it can have for app version
When Y == 1, then the host can only have one result per day for the app
version
When Y < X && Y > 1 then the host can have Y results in progress per
processing unit for the app_version

This mechanism will allow a computer to run an unlimited amount of work per
day as long as Y is greater than 1.  It just can't build up a large cache
unless Y==X.

This mechanism also should have a limited impact on the database as most
computers will be at either Y == 1 or Y == X so there will be few queries
to see how many results in progress there are.

Kevin Reed




From:        David Anderson <[email protected]>
To:          BOINC Developers Mailing List <[email protected]>
Date:        05/24/2010 02:03 PM
Subject:           host punishment mechanism revisited



The BOINC scheduler has a mechanism called "host punishment"
designed to deal with hosts that request an infinite sequence of jobs,
and either error out on them or never return them.

It works like this: there's a project parameter called "daily result
quota",
say 100.  Every host starts off with this quota.
If it returns an error or times out, the quota is decremented down to,
but not below, 1.  If it returns a valid result, the quota is doubled.
The idea is that faulty hosts are given 1 job per day
to see if they've been fixed.

Recently this mechanism was changed from per-project to per-app-version,
the idea being that a host might be erroring out on a GPU version
but not the CPU version.

However, the basic mechanism is somewhat flawed:

1) What if a fast host can do more than 100 jobs a day?
We could increase the default quota, but that would let bad hosts
trash that many more jobs.

2) It takes too long for a fixed host to ramp up its quota.

The bottom line: as long as a host is sending correct results,
it shouldn't have a daily quota at all.

---------

If anyone has ideas for how to change the host punishment mechanism,
please let me know.
I'll think about it and post a proposal at some point.

-- David
(See attached file: graycol.gif)
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

<<attachment: graycol.gif>>

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to