I changed the design to avoid the use of project-supplied credit: http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen
-- David On 27-Oct-2010 6:16 AM, [email protected] wrote: > I see a major flaw with using RAC. Suppose we have a project (say CPDN) > that takes several months on a particular computer, granting credit all the > way, and in constant high priority. (Yes, I know, a somewhat slow > computer, but they still exist). At the end of that time the RAC for CPDN > is well established but then starts to decay, and it will not be that long > before a few tasks from the other project (say Spinhenge with< 1/2 day > tasks on the same computer) attached are completed and validated. This > will generate a spike in RAC for Spin Henge and another CPDN task will be > downloaded. The instant conclusion is that the half-life of the RAC for > long term scheduling has to be much longer than the length of the longest > task on a particular computer for it to make any sense at all. > > Let's say the CPDN RAC at the end of that task is 100. And the RAC for > Spinhenge is 0. At the end of a week of running Spinhenge only, the RAC > for Spinhenge should be approaching 100 and the RAC for CPDN is 50... > > Using server side data requires an update to fetch the data. > Unfortunately, a project that has a high reported RAC at a client is > unlikely to be contacted for any reason. It is entirely possible that a > situation like having the validators off line for a week could permanently > turn off the project once they come back online. A computer reports a few > tasks, and is told that the RAC right now is 100,000 because a weeks worth > of work has just been validated in the last minute. This pushes the > project to the bottom of the list for contact on that particular host. > Since the RAC reported from the server never changes until the server is > contacted again to report work or fetch work, this host may never get > around to contacting that project again. The data must be calculated > locally from the best information available at the time. > > Another major flaw is that RAC is much too slow for use as a scheduler. It > will run only one project for a long time, then only another project for a > long time. It will not switch on anything like an hourly basis. > > What about machines that contact the servers only once a week or so? The > data on the host is going to be quite stale by the end of the week. > > So a counter proposal: > > 1) Use a STD / device type for short term scheduling. Not perfect maybe, > but the short term scheduler needs to be responsive to local data only as > it cannot count on feedback from the servers. RAF does not work well as > once the work is downloaded, it is already set for a specific device type. > > 2) Instead of Recent Average Credit, use Average Credit. Write a some of > data into the client_state.xml file that included the time now and the host > credit now at the time of the first install of the version that uses this > scheduler or on attach of a new project, or on a reset of a project, write > the current time and the current credit as reported by the server as > initial conditions. At the time that work is fetched, use (current credit > - initial credit) / (now - initial time) + C * RAF as the criteria for > where to try to fetch work from. Note that backoff will eventually allow > other projects than the top one to fetch work. Note the C will need to be > negative because if it is positive, projects that have just completed work > will have a high RAF and will be the first in line to get more. The long > term credit average needs to be a major component, I would propose that > they be about half each. > > 3) This will require a change to the policy of how much work to fetch from > any project, and overall. The current LTD method leaves some fairly good > methods for determining a choke number. I am not certain that the proposed > scheme does so. The client should neither fetch all of the work from a > single project, nor should it allow work fetch from a project that > consistently runs high priority and has used more than its share of > resource time. > > One final note: > > There will be no way at all to balance some resource share allocations > across a single platform. Suppose that there are 3 projects attached to a > computer all with equal resource shares. The GPU runs 10 * as fast as the > CPU, and one of the tasks will run CPU or GPU and the other two will run > CPU only. The GPU / CPU project will never run on the CPU (this is OK) and > it will have a much higher average credit and RAF than the two CPU > projects. Yet the project that can run on the GPU cannot be choked off > from GPU work fetch as that is the only project that can run on the GPU. > This would be made substantially easier if the client knew which device > types the project could supply work for. The proposal is that the project > provide a list of device types supported on every update. The client could > then incorporate this into the decision as to where to fetch work from. > When building a work fetch for the GPU in this case, it would scan the list > of projects and only compare those that it knew could support the GPU to > determine work fetch for the GPU. The single project in this case that > supported the GPU would then be eligible for a full work fetch of min_queue > + extra_work, instead of just min_queue (because it has and will always use > too much of the resources of the computer because of the wide variation in > the abilities of the devices. > > Counter Proposal 2: > > Give up on treating all the devices on the host as a single entity. Treat > each different type of device as a separate computer for the purposes of > work fetch. This may not be what the end users want though. > > jm7 > > > > David Anderson > <[email protected] > ey.edu> To > Sent by: BOINC Developers Mailing List > <boinc_dev-bounce<[email protected]> > [email protected] cc > u> > Subject > [boinc_dev] proposed scheduling > 10/26/2010 05:13 policy changes > PM > > > > > > > > > > Experiments with the client simulator using Richard's scenario > made it clear that the current scheduling framework > (based on STD and LTD for separate processor types) is fatally flawed: > it may divide resources among projects in a way that makes no sense > and doesn't respect resource shares. > > In particular, resource shares, as some have already pointed out, > should apply to total work (as measured by credit) > rather than to individual processor types. > If two projects have equal resource shares, > they should ideally have equal RAC, > even if that means that one of them gets 100% of a particular processor > type. > > I think it's possible to do this, > although there are difficulties due to delayed credit granting. > I wrote up a design for this: > http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen > Comments are welcome. > > BTW, the new mechanisms would be significantly simpler than the old ones. > This is always a good sign. > > -- David > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. > > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
