Hopefully we're getting close to a push on per-app-version issues. Another one came up at AQUA this evening.
They have apps with wildly differing runtimes: I switched a host tonight from Fokker-Planck (under 50 minutes) to Adiabatic (preliminary prediction 8 days) - both CPU tasks, no coprocessor apps active at the moment. For research purposes, they like to get their results back quickly, so make extensive use of <max_wus_in_progress>. But to cover their planned 12-hour server maintenance window this weekend, I would have had to cache 15 or so FP tasks. They can't allow that for the 8-day tasks. So <max_wus_in_progress> is another candidate for migration to per-app-version, please. ----- Original Message ----- From: "David Anderson" <da...@ssl.berkeley.edu> To: "Richard Haselgrove" <r.haselgr...@btinternet.com> Cc: <john.mcl...@sybase.com>; <boinc_dev@ssl.berkeley.edu> Sent: Wednesday, January 06, 2010 7:15 PM Subject: Re: [boinc_dev] Preemption of very short tasks. > The "temp DCF" change doesn't address the following problem. > The current plan is to keep track of per-app-version DCF on the server; > I hope get to this in the next couple of months. > -- David > > Richard Haselgrove wrote: >> There's also a converse problem if a project supplies too-small job FLOP >> counts, in that EDF may not be invoked soon enough: this particularly >> applies if a long, under-estimated task follows a succession of shorter >> and/or better estimated tasks. >> >> We first saw this clearly with the introduction of Astropulse under the >> s...@home banner. Many people use optimised sah applications: indeed, the >> stock sah application has incorporated many optimisations over the years, >> meaning that the sah stock job FLOP counts are routinely too big >> (typically by a factor of ~5 for modern Core2 CPUs, leading to DCF values >> of ~0.2). Full CPU optimisation can double the effect, leading to a DCF >> of ~0.1. If a succession of such tasks is followed by an >> accurately-estimated AP task ("too small", in the context of the >> over-estimated tasks which preceded it), BOINC will assume that the >> following task will complete much sooner than will be the case in >> reality. In the case of the initial release of Astropulse (when no >> comparable optimisations were available), I seem to remember that BOINC >> would form an estimate that the tasks would take ~10 hours on a Core2, >> when in reality they would take ~40 hours. >> >> Of course, as soom as an Astropulse task completed, DCF would be reset >> and new estimates calculated, but by then BOINC could have got itself >> into serious work over-fetch trouble. A single-project SETI cruncher with >> a 10-day cache setting (not an unknown animal!), caching AP tasks on the >> basis of the 10-hour estimate, could find themselves with a 40-day cache >> as soon as DCF corrected itself, and no way of completing them all within >> deadline, EDF or not. >> >> The '"temp DCF" for the app version' envisioned by changeset 20077 will >> be of some help in this kind of situation, because it should start to >> inhibit work fetch as soon as a task seems to be outstaying the initial >> estimate (something I think I've suggested in the past). But it isn't >> going to work in the sah_enh / AP case (different apps) if the 'temp DCF' >> is maintained by app_version, while the permanent DCF is still kept at >> the project level. The scope for the termp and permanent DCFs has to be >> the same: ideally both app_version. >> >> ----- Original Message ----- From: "David Anderson" >> <da...@ssl.berkeley.edu> >> To: <john.mcl...@sybase.com> >> Cc: <boinc_dev@ssl.berkeley.edu> >> Sent: Wednesday, January 06, 2010 6:25 AM >> Subject: Re: [boinc_dev] Preemption of very short tasks. >> >> >>> Several recent posts have described the same scenario: >>> a project supplies too-large job FLOP counts. >>> Its jobs are projected to miss deadline, and start off in EDF. >>> As their fraction done increases, their completion >>> estimates improve and they no longer miss deadline. >>> They're preempted and other jobs from the project are started. >>> Soon there are lots of partly-finished jobs. >>> >>> I checked in a change that should fix this. >>> The basic idea: information from running jobs is used to scale the >>> completion estimates of unstarted jobs. >>> >>> -- David >>> >>> john.mcl...@sybase.com wrote: >>>> I am attached to GoldBach's conjecture which is running some very short >>>> tasks (~2 minutes). I have a large number of these that have been >>>> pre-empted at around 1:55. I believe that what is happening is that >>>> Goldbach's is asked for work, and provides some. The tasks immediately >>>> enter EDF. Since it is a dual CPU system, 2 of Goldbach's tasks are >>>> started at the same time. When the one of these two finishes, two >>>> other >>>> tasks are marked as requiring EDF and the one with only seconds >>>> remaining >>>> is then pre-empted. More tasks for Goldbach's are downloaded, and run >>>> with >>>> some of these also being suspended. This is leading to a rather large >>>> collection of mostly run tasks that will not be gotten to for a week or >>>> so >>>> more as they only have seconds left and the deadline is much later. >>>> The >>>> new tasks keep the STD low enough so that these that have very little >>>> time >>>> left are unlikely to complete in normal Round Robin, but will have to >>>> wait >>>> until the deadline to start the last few seconds (the safety margin was >>>> removed even though upload and report are not 0 time, they are being >>>> treated as such). This is leading to many more tasks in the queue than >>>> should be there. >>>> >>>> There are a couple of solutions: >>>> >>>> 1) Treat tasks with the same deadline in lexicographical order - even >>>> if >>>> some of them are marked as EDF and others are not. >>>> 2) If the rr_sim indicates a potential miss, let the tasks run out >>>> their >>>> current time slice unless a test of EDF completion also indicates a >>>> potential deadline miss. >>>> >>>> Either of these would allow the tasks that are mostly done to complete, >>>> and >>>> allow them to be uploaded and reported. This allows them to be >>>> uploaded >>>> and reported, which reduces the risk of hitting a major slowdown in the >>>> UI >>>> because of too many tasks on the client. >>>> >>>> jm7 >>>> >>> >>> _______________________________________________ >>> boinc_dev mailing list >>> boinc_dev@ssl.berkeley.edu >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>> To unsubscribe, visit the above URL and >>> (near bottom of page) enter your email address. >>> >> >> > > _______________________________________________ boinc_dev mailing list boinc_dev@ssl.berkeley.edu http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.