I can probably get some of them done over the weekend.
Yes, #8 was indeed supposed to have a tight deadline for the long project.
Tight deadline meaning wall time required / wall time till complete =
nearly 1.0.
I have a some ready made hosts from which I can grab the client_state.xml
for 9 and 10.
jm7
David Anderson
<[email protected]
ey.edu> To
<[email protected]>
10/29/2010 01:31 cc
PM BOINC Developers Mailing List
<[email protected]>
Subject
Re: [boinc_dev] proposed scheduling
policy changes
That's an excellent list.
I'd extend 8) to say that the long-job project has low slack time
so the jobs run EDF.
We should create client_state.xml files for each of these cases;
do you want to take a stab at this?
-- David
On 29-Oct-2010 6:37 AM, [email protected] wrote:
> There are several cases we have to be careful that we get right:
>
> 1) Single CPU, single project.
> 2) Single CPU, single GPU, one project for each.
> 3) Single CPU, two projects with moderate run times.
> 4) Single CPU, single GPU with 2 projects for CPU only and one for
GPU
> only.
> 5) Single CPU, single GPU with 1 project for CPU only and two for
GPU
> only.
> 6) Single CPU, single GPU with 1 project for CPU only, 1 project for
> GPU only, and one for both.
> 7) Single CPU, single GPU with 1 project for CPU only, and one for
> both.
> 8) Single CPU, 2 projects, one that has extremely long run times and
> one that has short run times.
> 9) Kitchen sink - CPU only. 50+ projects attached to a single CPU.
> 10) Kitchen sink - CPU& GPU. 50 + projects with several for GPU
and
> many for CPU and a few for both.
>
> Is it even possible to break case #1?
>
> The current mechanism breaks the shared GPU cases, so we need to find
a
> new one.
>
> Using anything involving the concept of "recent" will break #8 in the
> list above at least for work fetch. We need to move the doc from
Recent
> Average to Average for work fetch.
>
> Difference between work fetch and CPU scheduling. CPU scheduling has
to
> be done with numbers that respond locally because CPU scheduling is
done
> hourly or more frequently, and it can be weeks between scheduler
> contacts. Work fetch on the other hand may work well with numbers
that
> are pulled from the project. Yes, they will be a bit stale, but we
are
> aiming for a long term balance between projects. We might want to
> consider using completely different criteria for these two. I think
> that using a combination of RAF with RAEC makes sense for CPU
> scheduling. What makes more sense for work fetch would be using
Average
> Credit (Credit from the server, the average would be based on now -
> project start on this computer) combine this with a little bit of
> information about what is on the computer to be done, and what is on
the
> computer that has not yet been reported and we should have a pretty
good
> design. This leaves the problem of projects that cannot be contacted
> for work or are not supplying work long term. I believe that we can
> relatively easily ignore those projects that do not have work on the
> host for a particular resource type when requesting work for that
> resource type.
>
> Resource scheduling.
> The only requirements are a high priority mode so that we don't break
#9
> and #10 completely as well as occasionally breaking other scenarios,
and
> a normal mode of some sort. Almost any criteria will work for the
round
> robin scheduler as this is not the mechanism used for long term
> balancing of resource shares. This could be FIFO, the current Round
> Robin, or a Round Robin based on RAF. Note, however, that once a
task
> has been downloaded to a host, it needs to be worked on and it has
> already been assigned to a resource type, so scheduling should be
based
> on each resource type, attempts to do short term scheduling based on
> criteria across resource types are going to tend to run off to
infinity
> for one resource type or another.
>
> Work Fetch
> This is where we need to try to meet the long term resource
allocation.
> Unfortunately, using any concept that decays will break resource
share
> usage for #8. We will almost certainly need to use long term
averages
> of credit rather than short term averages. We can do a time weighted
> average with RAEC or RAF to achieve a better estimate of where the
> project ought to be right now.
>
> Work fetch cutoff.
> Any design that does not include some method of not requesting work
from
> projects that have needed to use a resource too much will break #9
and
> #10. Using the projects that have work for a resource type available
as
> the benchmark, we can have the rule that any project that has a lower
> work fetch priority than any project that has work on the system for
a
> particular resource type will not be asked to supply work for that
> resource except to fill the queues to min_queue.
>
> Work Fetch reduction..
> It would be useful for the work fetch routine to have some
information
> about which projects have applications for which resources. This
will
> limit the number of spurious work requests for CPU work from GPU only
> projects (GPU Grid for example) and for NVIDIA work from projects
that
> work for CPUs and ATI, and for GPU work of any kind for CPU only
> projects or macs if there is only windows applications. The list
would
> be fairly short. The XML would look like:
>
<supress_work_fetch><ATI>1</ATI><NVIDIA>1</NVIDIA></supress_work_fetch>
> for a CPU only project that had support for all of the various OSs.
> This would allow projects to suppress work fetch requests for
resources
> that are not in their stable of applications.
>
> Having a hysteresis in the work queue will help if there is a single
> project, but starts to have less effect with more projects attached
to a
> host as the work fetch is quite possibly going to be to a project
other
> than the one that is about to get a report. Hosts with multiple
> projects will also tend to make smaller requests for work. I am not
> saying that this is enough to throw out the idea. If the queue is
full
> to the maximum (for all instances of all resources) do we stop
> requesting work, even if it is from otherwise eligible projects? If
we
> do, this would limit short term variability. If we don't, this would
> either make the hysteresis very large indeed, or nonexistent
depending
> on what we do with min_queue. Do we wait for a resource instance to
> drop below min_queue before requesting work for that resource from
any
> project? The answer to this question interacts with the answer to
the
> previous question in interesting ways. Note that since there can be
> more than one project, the client should remember which direction it
is
> going. Each resource type should have a flag that indicates which
> direction it is going. If the client is attempting to fill the
queues
> from multiple projects, it should keep asking different projects for
> work until it has reached the maximum queue allowed. If it is
working
> off the current work load, it should not ask for work until it has
> reached the minimum for some instance of the resource. My suggestion
> would be to stop asking for work from any project when the queue is
> filled to maximum until some resource instance has dropped below
> minimum. On a dual CPU system, having a single CPDN task does not
fill
> both instances of the queue...
>
> Determining the amount of work available for a resource instance.
> Using the round robin scheduler to make this determination has left
my
> CPUs without work several times when tasks with little work left but
> approaching deadlines were done early in a disconnected period
leaving
> the single long running task (typically CPDN) to run on its own later
> with idle CPUs on the same machine. An estimate of the least
friendly
> packing of tasks into CPUs is to assign the shortest remaining run
time
> tasks to resources first. This can be done with a single pass
through
> the tasks.
>
> jm7
>
>
>
> David Anderson
> <[email protected]
> ey.edu>
To
> <[email protected]>
> 10/28/2010 05:23
cc
> PM BOINC Developers Mailing List
> <[email protected]>
>
Subject
> Re: [boinc_dev] proposed
scheduling
> policy changes
>
>
>
>
>
>
>
>
>
>
> John:
> I updated the proposal:
> http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen
> Please review.
> -- David
>
> On 28-Oct-2010 11:56 AM, [email protected] wrote:
>> Work Fetch Cutoff.
>>
>> Proposed that:
>>
>> A project is only eligible for a full work fetch if it has a higher work
>> fetch priority than all projects that currently have work on the client
> or
>> there is no work on the client at all AND the project does not have a
>> resource share of 0 AND the host is not running all resources of that
> type
>> in High Priority.
>>
>> Full work fetch is the maximum (("extra work" + "connect every X") *
>> resource fraction, sum (estimated resource idle time before "extra work"
> +
>> "connect every X" from now)) for a single resource type. This would be
>> determined by from a least estimated runtime remaining first.
>>
>> If a project is not eligible for full work fetch or all of a device type
> is
>> running High Priority, the project would be limited to filling sum
>> (estimated resource idle time before "connect every X" from now) for the
>> resource type. (Fill the queue to the lowest acceptable point from the
>> least ineligible project).
>>
>> This will limit the ability of projects to get perpetually deeper in
debt
>> to the other projects on CPU time. It also side steps the issue where a
>> GPU has only one project available - the higher priority projects that
> have
>> no work available do not count against the work fetch for this project.
> It
>> also sidesteps the issue where a project is down for a long time - it
> won't
>> have work to run on the system, and will therefore not count against the
>> other projects filling the queue.
>>
>> jm7
>>
>>
>>
>> David Anderson
>> <[email protected]
>> ey.edu>
> To
>> <[email protected]>
>> 10/27/2010 02:41
> cc
>> PM BOINC Developers Mailing List
>> <[email protected]>
>>
> Subject
>> Re: [boinc_dev] proposed
> scheduling
>> policy changes
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 27-Oct-2010 11:38 AM, [email protected] wrote:
>>
>>>
>>> The proposed work fetch policy will also mean that more work will be
>>> fetched from the projects where the machine is least effective,
assuming
>>> FLOP counting is used to grant credits.
>>
>> That's a good point.
>> I'm now leaning towards client-estimated credit
>> rather than actual credit as a basis.
>> (Also because of the issues that Richard raised)
>>
>> -- David
>>
>>
>>
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.