Changing the time between checks WILL NOT FIX THE PROBLEM.

jm7


                                                                           
             "Paul D. Buck"                                                
             <p.d.b...@comcast                                             
             .net>                                                      To 
                                       [email protected]              
             04/27/2009 05:20                                           cc 
             PM                        "Josef W. Segur"                    
                                       <[email protected]>, BOINC dev   
                                       <[email protected]>, David 
                                       Anderson <[email protected]>,  
                                       Rom Walton <[email protected]>,      
                                       Richard Haselgrove                  
                                       <[email protected]>       
                                                                   Subject 
                                       Re: [boinc_dev] 6.6.20 and work     
                                       scheduling                          
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Now that we know that one of the drivers for the current system is a
project that collapsed under unrealistic expectations let us no rejoin the
real world and consider the scheduling of currently available tasks on the
resources.

Project deadlines at this time run from 24 hours to 18 months with the
normal deadline being most likely 7-14 days.  Projects with short deadlines
also tend to have short run time tasks (IBERCIVIS, VTU, MW, etc.).  With
"reasonable" system parameters of cache sizes I have never seen true
deadline peril on my running systems.

However, because of some unreasonable design decisions, made in good faith,
we have a confluence of factors that give rise to instability in scheduling
and running work.

The problems with Resource Scheduling include, but may not be limited to:

1) We do it too often (event driven)
2) All currently running tasks are eligible for preemption
3) TSI is not respected as a limiting factor
4) TSI is used in calculating deadline peril
5) Work mix is not kept "interesting"
6) Resource Share is used in calculating run time allocations
7) Work "batches" (tasks with roughly similar deadlines) are not "bank
teller queued"
8) History of work scheduling is not preserved and all deadlines are
calculated fresh each invocation.
9) True deadline peril is rare, but "false positives" are common
10) Some of the sources of work peril may be caused by a defective work
fetch allocation
11) Other factors either obscured by the above, I forgot them, or maybe
nothing else ...

These factors together cause instability in the scheduling and running of
tasks in the BOINC Client.  Changing one or two of these rules or factors
will not bring the needed stability.

The changes I would suggest is to, first of all, change the gateway so that
there is a floor to the number of times per HOUR that we reschedule the
CPU.  To be honest I would suggest we actually use TSI as the limit that it
SHOULD be.  Remove all of the event based drivers to the CU scheduler aside
from task completion.  This change addresses #1, #2, and #3.


Collect the average time between task completions.  Use this in place of
TSI in the calculations of deadline peril.  This is a more realistic value
for the actual deadline potential than TSI anyway.  A multiplier can be
used, I suggested in the past a value of 1.5, to give some margin for
error.  This addresses #4.


Pull the work from the queue in deadline order on task completion or TSI.
I guess that this is EDF mode, though my experience to this point leads me
to believe that the end result is that batches are still going to swamp the
system.  My suggestion was, and still is, to consider a resource throttle
to limit the number of instances of tasks running.  Limiting factors here
are the number of projects which currently have work.  If I have work from
only 4 projects and 8 resources then at best I am still going to have two
resources applied to each project.

If we let the Resource Share be controlling on Work Fetch only and not let
it be a controlling factor on systems with 4 or more resources then we can
still maintain our overall balance that way and this frees us from a
constraint that may not be necessary and may be leading to some unwanted
instability.

If we have work from more projects than resources then unless we cannot
process all of the tasks serially we should limit work to resources on a
per project basis.  NCI projects would not be considered in this
calculation.  To put it another way, in calculating deadline peril I would
line the tasks up from projects serially and if I could still get them all
done in time I would not run more than one task from any one project at a
time, as long as I had tasks from other projects.  I might be running tasks
from IBERCIVIS for some time on one resource (for example), but so what,
likely I will be happily running tasks from the other project off on the
other CPUs/resources.

This takes care of #5, #6, and #7


Though I suggest TSI as the floor below what we shall not go in scheduling
resources, aside from task completion as the only event that causes a
scheduling demand.  I note that this may not be acceptable to some.  The
alternative is to add a temporary number command in the CC Config file that
we can use to set the floors while making tests.  Once we derive a viable
number that can be coded or we can still let the participant decide ...  I
could be flip and suggest that this limit be coded in nanoseconds so that
John can run the CPU Scheduling routine as often as he desires, but,
minutes might be a more reasonable unit ...

Items #8 and possibly #9 may become irrelevant with the above changes.  Of
course, making the system more stable may, in fact, uncover some sources of
instability and error I have not yet noticed because they are swamped by
the noise of the other factors.  I will note that history of tasks we
started might be something we should be considering maintaining.  Or to put
it another way, if the task was worthy enough to start processing and we
preempted it for some other task, and it has not completed its TSI, that we
should consider that it should be run to its TSI or completion when the
next resource becomes available unless we have a true deadline peril.


The last two, #10 and #11 are buried under the swamp.

Were I able to code this can compile it I would.  But as John has already
stated when you don't want to hear what the problems are... you are going
to pretend that the input is invalid ... so any test I do would be
automatically rejected anyway, meaning it would be pointless for me to even
try ...

After all, it *IS* so much faster to make up your mind first rather than to
investigate and find out what the situation might really be ...

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to