I definitely agree with Paul; I don't see a good reason to test for a CPU reschedule more than once a minute and 5 to 10 minutes seems more reasonable with the exception of a resource going idle. In the best case we are wasting cycles that could be used by the science app. In the worst case we kill tasks that have trouble dealing with multiple stops and restarts.
john ----- Original Message ----- From: "Paul D. Buck" <[email protected]> To: "Richard Haselgrove" <[email protected]> Cc: "BOINC Developers Mailing List" <[email protected]>; "John McLeod" <[email protected]> Sent: Saturday, April 25, 2009 05:20 Subject: Re: [boinc_dev] 6.6.20 and work scheduling > > On Apr 25, 2009, at 1:48 AM, Richard Haselgrove wrote: > >> Paul D. Buck wrote >> >>> Again, the point being that we still have the issue were BOINC >>> continually changes its mind about what should be running. >>> >>> One of the causes is that we call the schedulling routine so >>> often. But that is not all of it. Even if we do throttle the >>> number of calls to this routine it will not stop the switching >>> because the deadline rules are still skee-woof, but it will help >>> reduce the onset slightly. >> >> Again, I disagree. >> >> 'Request CPU reschedule'. doesn't (shouldn't) automatically mean >> that a change will happen - I snipped many occurrences from my logs >> where the result was NOP. 'Request CPU reschedule'.calls a TEST to >> see IF a reschedule is necessary/appropropriate. Jiggling with the >> number of times the test is run won't make the slightest difference >> if the test itself is flawed. > > You are correct, if the test is wrong or the governing conditions are > wrong the timing of the test will not significantly alter things. > However, I have watched tasks do in and out of work in seconds to > minutes and the only thing that really changed was that some of these > events happened. If we were not triggering the test so often the bad > tests would not be run so often. Won't stop the bad decisions, > but ... I am still trying to find someone that can explain why running > these tests every 60 seconds makes so much sense. And if that makes > sense, why does running in every 10-20 seconds make more sense... > > The point I am stressing is that there is really only one time that > reschedule MUST in all cases be run ... and that is when a resource > goes idle. All other tests / cycles are mostly a waste of time. > >> I was intriqued by the "Idle state changed" case, because I see the >> effect on my P4/v5.10.13. Not every time, but enough to be >> noticeable. On this system, I've never noticed a timeslice run short >> - it's always been at least an hour, and then some. >> >> That suggests that on a slower, single-core, machine, 'Request CPU >> reschedule'.happens, if anything, too rarely. If there are no task >> exits, work fetches, or file downloads, then the TEST doesn't run. >> And it seems as if "Checkpoint reached" only triggers "Request >> enforce CPU schedule" - for the current schedule, presumably. > > I could wish ... > > Even on my slow Xeons I see tasks that never get their full slice ... > if they did they would never have gone into suspension. > > There are two issues here, and they create a "perfect storm" giving > rise to the effects we see... > > a) The tests are run far too often, and there is no floor to the > number of times per minute this reschedule is run ... and for the life > of me I can't even see why we are running it more than once every 5 > minutes. > > and, > > b) the deadline calculations are cranked. For one thing, if you have > a long TSI as I do, which if honored would not allow partially done > tasks unless they took more than 12 hours to complete, this biases the > system so that it considers far too many tasks in deadline trouble. > There may be other issues with these calculations. The problem is > that I can see them on the screen but cannot find the "proof" if you > will in the logs. > > Fundamentally I don't think the calculations properly work for "fast" > and "wide" systems. > > If I have a 4 core system and all 4 of the currently running tasks are > going to complete in the next 6 hours, the tasks that are due in 24 > hours and have 30 minute run times have zero need to run RIGHT NOW ... > but BOINC running the schedule tests will do just that ... what is > the point of calculating DCF if we are not going to use it to certify > the estimates of run times ... > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. -------------------------------------------------------------------------------- No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.287 / Virus Database: 270.12.4/2080 - Release Date: 04/25/09 08:29:00 _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
