Re: [boinc_dev] proposed scheduling policy changes

John . McLeod Fri, 29 Oct 2010 10:54:20 -0700

I can probably get some of them done over the weekend.

Yes, #8 was indeed supposed to have a tight deadline for the long project.
Tight deadline meaning wall time required / wall time till complete =
nearly 1.0.


I have a some ready made hosts from which I can grab the client_state.xml
for 9 and 10.

jm7


                                                                           
             David Anderson                                                
             <[email protected]                                             
             ey.edu>                                                    To 
                                       <[email protected]>            
             10/29/2010 01:31                                           cc 
             PM                        BOINC Developers Mailing List       
                                       <[email protected]>        
                                                                   Subject 
                                       Re: [boinc_dev] proposed scheduling 
                                       policy changes                      
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




That's an excellent list.
I'd extend 8) to say that the long-job project has low slack time
so the jobs run EDF.

We should create client_state.xml files for each of these cases;
do you want to take a stab at this?

-- David

On 29-Oct-2010 6:37 AM, [email protected] wrote:
> There are several cases we have to be careful that we get right:
>
>     1)  Single CPU, single project.
>     2)  Single CPU, single GPU, one project for each.
>     3)  Single CPU, two projects with moderate run times.
>     4)  Single CPU, single GPU with 2 projects for CPU only and one for
GPU
>     only.
>     5)  Single CPU, single GPU with 1 project for CPU only and two for
GPU
>     only.
>     6)  Single CPU, single GPU with 1 project for CPU only, 1 project for
>     GPU only, and one for both.
>     7)  Single CPU, single GPU with 1 project for CPU only, and one for
>     both.
>     8)  Single CPU, 2 projects, one that has extremely long run times and
>     one that has short run times.
>     9)  Kitchen sink - CPU only.  50+ projects attached to a single CPU.
>     10)  Kitchen sink - CPU&  GPU.  50 + projects with several for GPU
and
>     many for CPU and a few for both.
>
>     Is it even possible to break case #1?
>
>     The current mechanism breaks the shared GPU cases, so we need to find
a
>     new one.
>
>     Using anything involving the concept of "recent" will break #8 in the
>     list above at least for work fetch.  We need to move the doc from
Recent
>     Average to Average for work fetch.
>
>     Difference between work fetch and CPU scheduling.  CPU scheduling has
to
>     be done with numbers that respond locally because CPU scheduling is
done
>     hourly or more frequently, and it can be weeks between scheduler
>     contacts.  Work fetch on the other hand may work well with numbers
that
>     are pulled from the project.  Yes, they will be a bit stale, but we
are
>     aiming for a long term balance between projects.  We might want to
>     consider using completely different criteria for these two.  I think
>     that using a combination of RAF with RAEC makes sense for CPU
>     scheduling.  What makes more sense for work fetch would be using
Average
>     Credit (Credit from the server, the average would be based on now -
>     project start on this computer) combine this with a little bit of
>     information about what is on the computer to be done, and what is on
the
>     computer that has not yet been reported and we should have a pretty
good
>     design.  This leaves the problem of projects that cannot be contacted
>     for work or are not supplying work long term.  I believe that we can
>     relatively easily ignore those projects that do not have work on the
>     host for a particular resource type when requesting work for that
>     resource type.
>
>     Resource scheduling.
>     The only requirements are a high priority mode so that we don't break
#9
>     and #10 completely as well as occasionally breaking other scenarios,
and
>     a normal mode of some sort.  Almost any criteria will work for the
round
>     robin scheduler as this is not the mechanism used for long term
>     balancing of resource shares.  This could be FIFO, the current Round
>     Robin, or a Round Robin based on RAF.  Note, however, that once a
task
>     has been downloaded to a host, it needs to be worked on and it has
>     already been assigned to a resource type, so scheduling should be
based
>     on each resource type, attempts to do short term scheduling based on
>     criteria across resource types are going to tend to run off to
infinity
>     for one resource type or another.
>
>     Work Fetch
>     This is where we need to try to meet the long term resource
allocation.
>     Unfortunately, using any concept that decays will break resource
share
>     usage for #8.  We will almost certainly need to use long term
averages
>     of credit rather than short term averages.  We can do a time weighted
>     average with RAEC or RAF to achieve a better estimate of where the
>     project ought to be right now.
>
>     Work fetch cutoff.
>     Any design that does not include some method of not requesting work
from
>     projects that have needed to use a resource too much will break #9
and
>     #10.  Using the projects that have work for a resource type available
as
>     the benchmark, we can have the rule that any project that has a lower
>     work fetch priority than any project that has work on the system for
a
>     particular resource type will not be asked to supply work for that
>     resource except to fill the queues to min_queue.
>
>     Work Fetch reduction..
>     It would be useful for the work fetch routine to have some
information
>     about which projects have applications for which resources.  This
will
>     limit the number of spurious work requests for CPU work from GPU only
>     projects (GPU Grid for example) and for NVIDIA work from projects
that
>     work for CPUs and ATI, and for GPU work of any kind for CPU only
>     projects or macs if there is only windows applications.  The list
would
>     be fairly short.  The XML would look like:
>
<supress_work_fetch><ATI>1</ATI><NVIDIA>1</NVIDIA></supress_work_fetch>
>     for a CPU only project that had support for all of the various OSs.
>     This would allow projects to suppress work fetch requests for
resources
>     that are not in their stable of applications.
>
>     Having a hysteresis in the work queue will help if there is a single
>     project, but starts to have less effect with more projects attached
to a
>     host as the work fetch is quite possibly going to be to a project
other
>     than the one that is about to get a report.  Hosts with multiple
>     projects will also tend to make smaller requests for work.  I am not
>     saying that this is enough to throw out the idea.  If the queue is
full
>     to the maximum (for all instances of all resources) do we stop
>     requesting work, even if it is from otherwise eligible projects?  If
we
>     do, this would limit short term variability.  If we don't, this would
>     either make the hysteresis very large indeed, or nonexistent
depending
>     on what we do with min_queue.  Do we wait for a resource instance to
>     drop below min_queue before requesting work for that resource from
any
>     project?  The answer to this question interacts with the answer to
the
>     previous question in interesting ways.  Note that since there can be
>     more than one project, the client should remember which direction it
is
>     going.  Each resource type should have a flag that indicates which
>     direction it is going.  If the client is attempting to fill the
queues
>     from multiple projects, it should keep asking different projects for
>     work until it has reached the maximum queue allowed.  If it is
working
>     off the current work load, it should not ask for work until it has
>     reached the minimum for some instance of the resource.  My suggestion
>     would be to stop asking for work from any project when the queue is
>     filled to maximum until some resource instance has dropped below
>     minimum.  On a dual CPU system, having a single CPDN task does not
fill
>     both instances of the queue...
>
>     Determining the amount of work available for a resource instance.
>     Using the round robin scheduler to make this determination has left
my
>     CPUs without work several times when tasks with little work left but
>     approaching deadlines were done early in a disconnected period
leaving
>     the single long running task (typically CPDN) to run on its own later
>     with idle CPUs on the same machine.  An estimate of the least
friendly
>     packing of tasks into CPUs is to assign the shortest remaining run
time
>     tasks to resources first.  This can be done with a single pass
through
>     the tasks.
>
> jm7
>
>
>
>               David Anderson
>               <[email protected]
>               ey.edu>
To
>                                         <[email protected]>
>               10/28/2010 05:23
cc
>               PM                        BOINC Developers Mailing List
>                                         <[email protected]>
>
Subject
>                                         Re: [boinc_dev] proposed
scheduling
>                                         policy changes
>
>
>
>
>
>
>
>
>
>
> John:
> I updated the proposal:
> http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen
> Please review.
> -- David
>
> On 28-Oct-2010 11:56 AM, [email protected] wrote:
>> Work Fetch Cutoff.
>>
>> Proposed that:
>>
>> A project is only eligible for a full work fetch if it has a higher work
>> fetch priority than all projects that currently have work on the client
> or
>> there is no work on the client at all AND the project does not have a
>> resource share of 0 AND the host is not running all resources of that
> type
>> in High Priority.
>>
>> Full work fetch is the maximum (("extra work" + "connect every X") *
>> resource fraction, sum (estimated resource idle time before "extra work"
> +
>> "connect every X" from now)) for a single resource type.  This would be
>> determined by from a least estimated runtime remaining first.
>>
>> If a project is not eligible for full work fetch or all of a device type
> is
>> running High Priority, the project would be limited to filling sum
>> (estimated resource idle time before "connect every X" from now) for the
>> resource type.  (Fill the queue to the lowest acceptable point from the
>> least ineligible project).
>>
>> This will limit the ability of projects to get perpetually deeper in
debt
>> to the other projects on CPU time.  It also side steps the issue where a
>> GPU has only one project available - the higher priority projects that
> have
>> no work available do not count against the work fetch for this project.
> It
>> also sidesteps the issue where a project is down for a long time - it
> won't
>> have work to run on the system, and will therefore not count against the
>> other projects filling the queue.
>>
>> jm7
>>
>>
>>
>>                David Anderson
>>                <[email protected]
>>                ey.edu>
> To
>>                                          <[email protected]>
>>                10/27/2010 02:41
> cc
>>                PM                        BOINC Developers Mailing List
>>                                          <[email protected]>
>>
> Subject
>>                                          Re: [boinc_dev] proposed
> scheduling
>>                                          policy changes
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 27-Oct-2010 11:38 AM, [email protected] wrote:
>>
>>>
>>> The proposed work fetch policy will also mean that more work will be
>>> fetched from the projects where the machine is least effective,
assuming
>>> FLOP counting is used to grant credits.
>>
>> That's a good point.
>> I'm now leaning towards client-estimated credit
>> rather than actual credit as a basis.
>> (Also because of the issues that Richard raised)
>>
>> -- David
>>
>>
>>
>
>
>



_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] proposed scheduling policy changes

Reply via email to