Hopefully we're getting close to a push on per-app-version issues.

Another one came up at AQUA this evening.

They have apps with wildly differing runtimes: I switched a host tonight 
from Fokker-Planck (under 50 minutes) to Adiabatic (preliminary prediction 8 
days) - both CPU tasks, no coprocessor apps active at the moment.

For research purposes, they like to get their results back quickly, so make 
extensive use of <max_wus_in_progress>. But to cover their planned 12-hour 
server maintenance window this weekend, I would have had to cache 15 or so 
FP tasks. They can't allow that for the 8-day tasks.

So <max_wus_in_progress> is another candidate for migration to 
per-app-version, please.


----- Original Message ----- 
From: "David Anderson" <da...@ssl.berkeley.edu>
To: "Richard Haselgrove" <r.haselgr...@btinternet.com>
Cc: <john.mcl...@sybase.com>; <boinc_dev@ssl.berkeley.edu>
Sent: Wednesday, January 06, 2010 7:15 PM
Subject: Re: [boinc_dev] Preemption of very short tasks.


> The "temp DCF" change doesn't address the following problem.
> The current plan is to keep track of per-app-version DCF on the server;
> I hope get to this in the next couple of months.
> -- David
>
> Richard Haselgrove wrote:
>> There's also a converse problem if a project supplies too-small job FLOP 
>> counts, in that EDF may not be invoked soon enough: this particularly 
>> applies if a long, under-estimated task follows a succession of shorter 
>> and/or better estimated tasks.
>>
>> We first saw this clearly with the introduction of Astropulse under the 
>> s...@home banner. Many people use optimised sah applications: indeed, the 
>> stock sah application has incorporated many optimisations over the years, 
>> meaning that the sah stock job FLOP counts are routinely too big 
>> (typically by a factor of ~5 for modern Core2 CPUs, leading to DCF values 
>> of ~0.2). Full CPU optimisation can double the effect, leading to a DCF 
>> of ~0.1. If a succession of such tasks is followed by an 
>> accurately-estimated AP task ("too small", in the context of the 
>> over-estimated tasks which preceded it), BOINC will assume that the 
>> following task will complete much sooner than will be the case in 
>> reality. In the case of the initial release of Astropulse (when no 
>> comparable optimisations were available), I seem to remember that BOINC 
>> would form an estimate that the tasks would take ~10 hours on a Core2, 
>> when in reality they would take ~40 hours.
>>
>> Of course, as soom as an Astropulse task completed, DCF would be reset 
>> and new estimates calculated, but by then BOINC could have got itself 
>> into serious work over-fetch trouble. A single-project SETI cruncher with 
>> a 10-day cache setting (not an unknown animal!), caching AP tasks on the 
>> basis of the 10-hour estimate, could find themselves with a 40-day cache 
>> as soon as DCF corrected itself, and no way of completing them all within 
>> deadline, EDF or not.
>>
>> The '"temp DCF" for the app version' envisioned by changeset 20077 will 
>> be of some help in this kind of situation, because it should start to 
>> inhibit work fetch as soon as a task seems to be outstaying the initial 
>> estimate (something I think I've suggested in the past). But it isn't 
>> going to work in the sah_enh / AP case (different apps) if the 'temp DCF' 
>> is maintained by app_version, while the permanent DCF is still kept at 
>> the project level. The scope for the termp and permanent DCFs has to be 
>> the same: ideally both app_version.
>>
>> ----- Original Message ----- From: "David Anderson" 
>> <da...@ssl.berkeley.edu>
>> To: <john.mcl...@sybase.com>
>> Cc: <boinc_dev@ssl.berkeley.edu>
>> Sent: Wednesday, January 06, 2010 6:25 AM
>> Subject: Re: [boinc_dev] Preemption of very short tasks.
>>
>>
>>> Several recent posts have described the same scenario:
>>> a project supplies too-large job FLOP counts.
>>> Its jobs are projected to miss deadline, and start off in EDF.
>>> As their fraction done increases, their completion
>>> estimates improve and they no longer miss deadline.
>>> They're preempted and other jobs from the project are started.
>>> Soon there are lots of partly-finished jobs.
>>>
>>> I checked in a change that should fix this.
>>> The basic idea: information from running jobs is used to scale the
>>> completion estimates of unstarted jobs.
>>>
>>> -- David
>>>
>>> john.mcl...@sybase.com wrote:
>>>> I am attached to GoldBach's conjecture which is running some very short
>>>> tasks (~2 minutes).  I have a large number of these that have been
>>>> pre-empted at around 1:55.  I believe that what is happening is that
>>>> Goldbach's is asked for work, and provides some.  The tasks immediately
>>>> enter EDF.  Since it is a dual CPU system, 2 of Goldbach's tasks are
>>>> started at the same time.  When the one of these two finishes, two 
>>>> other
>>>> tasks are marked as requiring EDF and the one with only seconds 
>>>> remaining
>>>> is then pre-empted.  More tasks for Goldbach's are downloaded, and run 
>>>> with
>>>> some of these also being suspended.  This is leading to a rather large
>>>> collection of mostly run tasks that will not be gotten to for a week or 
>>>> so
>>>> more as they only have seconds left and the deadline is much later. 
>>>> The
>>>> new tasks keep the STD low enough so that these that have very little 
>>>> time
>>>> left are unlikely to complete in normal Round Robin, but will have to 
>>>> wait
>>>> until the deadline to start the last few seconds (the safety margin was
>>>> removed even though upload and report are not 0 time, they are being
>>>> treated as such).  This is leading to many more tasks in the queue than
>>>> should be there.
>>>>
>>>> There are a couple of solutions:
>>>>
>>>> 1)  Treat tasks with the same deadline in lexicographical order - even 
>>>> if
>>>> some of them are marked as EDF and others are not.
>>>> 2)  If the rr_sim indicates a potential miss, let the tasks run out 
>>>> their
>>>> current time slice unless a test of EDF completion also indicates a
>>>> potential deadline miss.
>>>>
>>>> Either of these would allow the tasks that are mostly done to complete, 
>>>> and
>>>> allow them to be uploaded and reported.  This allows them to be 
>>>> uploaded
>>>> and reported, which reduces the risk of hitting a major slowdown in the 
>>>> UI
>>>> because of too many tasks on the client.
>>>>
>>>> jm7
>>>>
>>>
>>> _______________________________________________
>>> boinc_dev mailing list
>>> boinc_dev@ssl.berkeley.edu
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>>>
>>
>>
>
> 


_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to