That's an excellent list. I'd extend 8) to say that the long-job project has low slack time so the jobs run EDF.
We should create client_state.xml files for each of these cases; do you want to take a stab at this? -- David On 29-Oct-2010 6:37 AM, [email protected] wrote: > There are several cases we have to be careful that we get right: > > 1) Single CPU, single project. > 2) Single CPU, single GPU, one project for each. > 3) Single CPU, two projects with moderate run times. > 4) Single CPU, single GPU with 2 projects for CPU only and one for GPU > only. > 5) Single CPU, single GPU with 1 project for CPU only and two for GPU > only. > 6) Single CPU, single GPU with 1 project for CPU only, 1 project for > GPU only, and one for both. > 7) Single CPU, single GPU with 1 project for CPU only, and one for > both. > 8) Single CPU, 2 projects, one that has extremely long run times and > one that has short run times. > 9) Kitchen sink - CPU only. 50+ projects attached to a single CPU. > 10) Kitchen sink - CPU& GPU. 50 + projects with several for GPU and > many for CPU and a few for both. > > Is it even possible to break case #1? > > The current mechanism breaks the shared GPU cases, so we need to find a > new one. > > Using anything involving the concept of "recent" will break #8 in the > list above at least for work fetch. We need to move the doc from Recent > Average to Average for work fetch. > > Difference between work fetch and CPU scheduling. CPU scheduling has to > be done with numbers that respond locally because CPU scheduling is done > hourly or more frequently, and it can be weeks between scheduler > contacts. Work fetch on the other hand may work well with numbers that > are pulled from the project. Yes, they will be a bit stale, but we are > aiming for a long term balance between projects. We might want to > consider using completely different criteria for these two. I think > that using a combination of RAF with RAEC makes sense for CPU > scheduling. What makes more sense for work fetch would be using Average > Credit (Credit from the server, the average would be based on now - > project start on this computer) combine this with a little bit of > information about what is on the computer to be done, and what is on the > computer that has not yet been reported and we should have a pretty good > design. This leaves the problem of projects that cannot be contacted > for work or are not supplying work long term. I believe that we can > relatively easily ignore those projects that do not have work on the > host for a particular resource type when requesting work for that > resource type. > > Resource scheduling. > The only requirements are a high priority mode so that we don't break #9 > and #10 completely as well as occasionally breaking other scenarios, and > a normal mode of some sort. Almost any criteria will work for the round > robin scheduler as this is not the mechanism used for long term > balancing of resource shares. This could be FIFO, the current Round > Robin, or a Round Robin based on RAF. Note, however, that once a task > has been downloaded to a host, it needs to be worked on and it has > already been assigned to a resource type, so scheduling should be based > on each resource type, attempts to do short term scheduling based on > criteria across resource types are going to tend to run off to infinity > for one resource type or another. > > Work Fetch > This is where we need to try to meet the long term resource allocation. > Unfortunately, using any concept that decays will break resource share > usage for #8. We will almost certainly need to use long term averages > of credit rather than short term averages. We can do a time weighted > average with RAEC or RAF to achieve a better estimate of where the > project ought to be right now. > > Work fetch cutoff. > Any design that does not include some method of not requesting work from > projects that have needed to use a resource too much will break #9 and > #10. Using the projects that have work for a resource type available as > the benchmark, we can have the rule that any project that has a lower > work fetch priority than any project that has work on the system for a > particular resource type will not be asked to supply work for that > resource except to fill the queues to min_queue. > > Work Fetch reduction.. > It would be useful for the work fetch routine to have some information > about which projects have applications for which resources. This will > limit the number of spurious work requests for CPU work from GPU only > projects (GPU Grid for example) and for NVIDIA work from projects that > work for CPUs and ATI, and for GPU work of any kind for CPU only > projects or macs if there is only windows applications. The list would > be fairly short. The XML would look like: > <supress_work_fetch><ATI>1</ATI><NVIDIA>1</NVIDIA></supress_work_fetch> > for a CPU only project that had support for all of the various OSs. > This would allow projects to suppress work fetch requests for resources > that are not in their stable of applications. > > Having a hysteresis in the work queue will help if there is a single > project, but starts to have less effect with more projects attached to a > host as the work fetch is quite possibly going to be to a project other > than the one that is about to get a report. Hosts with multiple > projects will also tend to make smaller requests for work. I am not > saying that this is enough to throw out the idea. If the queue is full > to the maximum (for all instances of all resources) do we stop > requesting work, even if it is from otherwise eligible projects? If we > do, this would limit short term variability. If we don't, this would > either make the hysteresis very large indeed, or nonexistent depending > on what we do with min_queue. Do we wait for a resource instance to > drop below min_queue before requesting work for that resource from any > project? The answer to this question interacts with the answer to the > previous question in interesting ways. Note that since there can be > more than one project, the client should remember which direction it is > going. Each resource type should have a flag that indicates which > direction it is going. If the client is attempting to fill the queues > from multiple projects, it should keep asking different projects for > work until it has reached the maximum queue allowed. If it is working > off the current work load, it should not ask for work until it has > reached the minimum for some instance of the resource. My suggestion > would be to stop asking for work from any project when the queue is > filled to maximum until some resource instance has dropped below > minimum. On a dual CPU system, having a single CPDN task does not fill > both instances of the queue... > > Determining the amount of work available for a resource instance. > Using the round robin scheduler to make this determination has left my > CPUs without work several times when tasks with little work left but > approaching deadlines were done early in a disconnected period leaving > the single long running task (typically CPDN) to run on its own later > with idle CPUs on the same machine. An estimate of the least friendly > packing of tasks into CPUs is to assign the shortest remaining run time > tasks to resources first. This can be done with a single pass through > the tasks. > > jm7 > > > > David Anderson > <[email protected] > ey.edu> To > <[email protected]> > 10/28/2010 05:23 cc > PM BOINC Developers Mailing List > <[email protected]> > Subject > Re: [boinc_dev] proposed scheduling > policy changes > > > > > > > > > > > John: > I updated the proposal: > http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen > Please review. > -- David > > On 28-Oct-2010 11:56 AM, [email protected] wrote: >> Work Fetch Cutoff. >> >> Proposed that: >> >> A project is only eligible for a full work fetch if it has a higher work >> fetch priority than all projects that currently have work on the client > or >> there is no work on the client at all AND the project does not have a >> resource share of 0 AND the host is not running all resources of that > type >> in High Priority. >> >> Full work fetch is the maximum (("extra work" + "connect every X") * >> resource fraction, sum (estimated resource idle time before "extra work" > + >> "connect every X" from now)) for a single resource type. This would be >> determined by from a least estimated runtime remaining first. >> >> If a project is not eligible for full work fetch or all of a device type > is >> running High Priority, the project would be limited to filling sum >> (estimated resource idle time before "connect every X" from now) for the >> resource type. (Fill the queue to the lowest acceptable point from the >> least ineligible project). >> >> This will limit the ability of projects to get perpetually deeper in debt >> to the other projects on CPU time. It also side steps the issue where a >> GPU has only one project available - the higher priority projects that > have >> no work available do not count against the work fetch for this project. > It >> also sidesteps the issue where a project is down for a long time - it > won't >> have work to run on the system, and will therefore not count against the >> other projects filling the queue. >> >> jm7 >> >> >> >> David Anderson >> <[email protected] >> ey.edu> > To >> <[email protected]> >> 10/27/2010 02:41 > cc >> PM BOINC Developers Mailing List >> <[email protected]> >> > Subject >> Re: [boinc_dev] proposed > scheduling >> policy changes >> >> >> >> >> >> >> >> >> >> >> >> >> On 27-Oct-2010 11:38 AM, [email protected] wrote: >> >>> >>> The proposed work fetch policy will also mean that more work will be >>> fetched from the projects where the machine is least effective, assuming >>> FLOP counting is used to grant credits. >> >> That's a good point. >> I'm now leaning towards client-estimated credit >> rather than actual credit as a basis. >> (Also because of the issues that Richard raised) >> >> -- David >> >> >> > > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
