Re: [boinc_dev] proposed scheduling policy changes

John . McLeod Fri, 29 Oct 2010 06:38:03 -0700

There are several cases we have to be careful that we get right:

   1)  Single CPU, single project.
   2)  Single CPU, single GPU, one project for each.
   3)  Single CPU, two projects with moderate run times.
   4)  Single CPU, single GPU with 2 projects for CPU only and one for GPU
   only.
   5)  Single CPU, single GPU with 1 project for CPU only and two for GPU
   only.
   6)  Single CPU, single GPU with 1 project for CPU only, 1 project for
   GPU only, and one for both.
   7)  Single CPU, single GPU with 1 project for CPU only, and one for
   both.
   8)  Single CPU, 2 projects, one that has extremely long run times and
   one that has short run times.
   9)  Kitchen sink - CPU only.  50+ projects attached to a single CPU.
   10)  Kitchen sink - CPU & GPU.  50 + projects with several for GPU and
   many for CPU and a few for both.

   Is it even possible to break case #1?

   The current mechanism breaks the shared GPU cases, so we need to find a
   new one.

   Using anything involving the concept of "recent" will break #8 in the
   list above at least for work fetch.  We need to move the doc from Recent
   Average to Average for work fetch.

   Difference between work fetch and CPU scheduling.  CPU scheduling has to
   be done with numbers that respond locally because CPU scheduling is done
   hourly or more frequently, and it can be weeks between scheduler
   contacts.  Work fetch on the other hand may work well with numbers that
   are pulled from the project.  Yes, they will be a bit stale, but we are
   aiming for a long term balance between projects.  We might want to
   consider using completely different criteria for these two.  I think
   that using a combination of RAF with RAEC makes sense for CPU
   scheduling.  What makes more sense for work fetch would be using Average
   Credit (Credit from the server, the average would be based on now -
   project start on this computer) combine this with a little bit of
   information about what is on the computer to be done, and what is on the
   computer that has not yet been reported and we should have a pretty good
   design.  This leaves the problem of projects that cannot be contacted
   for work or are not supplying work long term.  I believe that we can
   relatively easily ignore those projects that do not have work on the
   host for a particular resource type when requesting work for that
   resource type.

   Resource scheduling.
   The only requirements are a high priority mode so that we don't break #9
   and #10 completely as well as occasionally breaking other scenarios, and
   a normal mode of some sort.  Almost any criteria will work for the round
   robin scheduler as this is not the mechanism used for long term
   balancing of resource shares.  This could be FIFO, the current Round
   Robin, or a Round Robin based on RAF.  Note, however, that once a task
   has been downloaded to a host, it needs to be worked on and it has
   already been assigned to a resource type, so scheduling should be based
   on each resource type, attempts to do short term scheduling based on
   criteria across resource types are going to tend to run off to infinity
   for one resource type or another.

   Work Fetch
   This is where we need to try to meet the long term resource allocation.
   Unfortunately, using any concept that decays will break resource share
   usage for #8.  We will almost certainly need to use long term averages
   of credit rather than short term averages.  We can do a time weighted
   average with RAEC or RAF to achieve a better estimate of where the
   project ought to be right now.

   Work fetch cutoff.
   Any design that does not include some method of not requesting work from
   projects that have needed to use a resource too much will break #9 and
   #10.  Using the projects that have work for a resource type available as
   the benchmark, we can have the rule that any project that has a lower
   work fetch priority than any project that has work on the system for a
   particular resource type will not be asked to supply work for that
   resource except to fill the queues to min_queue.

   Work Fetch reduction..
   It would be useful for the work fetch routine to have some information
   about which projects have applications for which resources.  This will
   limit the number of spurious work requests for CPU work from GPU only
   projects (GPU Grid for example) and for NVIDIA work from projects that
   work for CPUs and ATI, and for GPU work of any kind for CPU only
   projects or macs if there is only windows applications.  The list would
   be fairly short.  The XML would look like:
   <supress_work_fetch><ATI>1</ATI><NVIDIA>1</NVIDIA></supress_work_fetch>
   for a CPU only project that had support for all of the various OSs.
   This would allow projects to suppress work fetch requests for resources
   that are not in their stable of applications.

   Having a hysteresis in the work queue will help if there is a single
   project, but starts to have less effect with more projects attached to a
   host as the work fetch is quite possibly going to be to a project other
   than the one that is about to get a report.  Hosts with multiple
   projects will also tend to make smaller requests for work.  I am not
   saying that this is enough to throw out the idea.  If the queue is full
   to the maximum (for all instances of all resources) do we stop
   requesting work, even if it is from otherwise eligible projects?  If we
   do, this would limit short term variability.  If we don't, this would
   either make the hysteresis very large indeed, or nonexistent depending
   on what we do with min_queue.  Do we wait for a resource instance to
   drop below min_queue before requesting work for that resource from any
   project?  The answer to this question interacts with the answer to the
   previous question in interesting ways.  Note that since there can be
   more than one project, the client should remember which direction it is
   going.  Each resource type should have a flag that indicates which
   direction it is going.  If the client is attempting to fill the queues
   from multiple projects, it should keep asking different projects for
   work until it has reached the maximum queue allowed.  If it is working
   off the current work load, it should not ask for work until it has
   reached the minimum for some instance of the resource.  My suggestion
   would be to stop asking for work from any project when the queue is
   filled to maximum until some resource instance has dropped below
   minimum.  On a dual CPU system, having a single CPDN task does not fill
   both instances of the queue...

   Determining the amount of work available for a resource instance.
   Using the round robin scheduler to make this determination has left my
   CPUs without work several times when tasks with little work left but
   approaching deadlines were done early in a disconnected period leaving
   the single long running task (typically CPDN) to run on its own later
   with idle CPUs on the same machine.  An estimate of the least friendly
   packing of tasks into CPUs is to assign the shortest remaining run time
   tasks to resources first.  This can be done with a single pass through
   the tasks.

jm7

             David Anderson                                                
             <[email protected]                                             
             ey.edu>                                                    To 
                                       <[email protected]>            
             10/28/2010 05:23                                           cc 
             PM                        BOINC Developers Mailing List       
                                       <[email protected]>        
                                                                   Subject 
                                       Re: [boinc_dev] proposed scheduling 
                                       policy changes                      

John:
I updated the proposal:
http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen
Please review.
-- David

On 28-Oct-2010 11:56 AM, [email protected] wrote:
> Work Fetch Cutoff.
>
> Proposed that:
>
> A project is only eligible for a full work fetch if it has a higher work
> fetch priority than all projects that currently have work on the client
or
> there is no work on the client at all AND the project does not have a
> resource share of 0 AND the host is not running all resources of that
type
> in High Priority.
>
> Full work fetch is the maximum (("extra work" + "connect every X") *
> resource fraction, sum (estimated resource idle time before "extra work"
+
> "connect every X" from now)) for a single resource type.  This would be
> determined by from a least estimated runtime remaining first.
>
> If a project is not eligible for full work fetch or all of a device type
is
> running High Priority, the project would be limited to filling sum
> (estimated resource idle time before "connect every X" from now) for the
> resource type.  (Fill the queue to the lowest acceptable point from the
> least ineligible project).
>
> This will limit the ability of projects to get perpetually deeper in debt
> to the other projects on CPU time.  It also side steps the issue where a
> GPU has only one project available - the higher priority projects that
have
> no work available do not count against the work fetch for this project.
It
> also sidesteps the issue where a project is down for a long time - it
won't
> have work to run on the system, and will therefore not count against the
> other projects filling the queue.
>
> jm7
>
>
>
>               David Anderson
>               <[email protected]
>               ey.edu>
To
>                                         <[email protected]>
>               10/27/2010 02:41
cc
>               PM                        BOINC Developers Mailing List
>                                         <[email protected]>
>
Subject
>                                         Re: [boinc_dev] proposed
scheduling
>                                         policy changes
>
>
>
>
>
>
>
>
>
>
>
>
> On 27-Oct-2010 11:38 AM, [email protected] wrote:
>
>>
>> The proposed work fetch policy will also mean that more work will be
>> fetched from the projects where the machine is least effective, assuming
>> FLOP counting is used to grant credits.
>
> That's a good point.
> I'm now leaning towards client-estimated credit
> rather than actual credit as a basis.
> (Also because of the issues that Richard raised)
>
> -- David
>
>
>

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] proposed scheduling policy changes

Reply via email to