Re: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate limiting from server

Richard Haselgrove Wed, 12 Feb 2014 10:02:48 -0800

This, on the other hand, is a case which should have been prevented by the old 
(existing) code: work fetch initiated between app exit and start of upload. It 
would be better to fix that, rather than introduce a whole new case handler.


12/02/2014 17:52:28 |  | [work_fetch] Request work fetch: application exited
12/02/2014 17:52:28 | Einstein@Home | Computation for task 
p2030.20131012.G181.70-02.47.S.b0s0g0.00000_3526_0 finished
12/02/2014 17:52:28 | Einstein@Home | Starting task 
p2030.20131012.G181.56-02.69.S.b6s0g0.00000_2733_1 using einsteinbinary_BRP4 
version 134 (opencl-intel_gpu) in slot 0
12/02/2014 17:52:29 | Einstein@Home | [work_fetch] set_request() for intel_gpu: 
ninst 1 nused_total 35.000000 nidle_now 0.000000 fetch share 1.000000 req_inst 
0.000000 req_secs 3467.114311
12/02/2014 17:52:29 | Einstein@Home | [sched_op] Starting scheduler request
12/02/2014 17:52:29 | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 
inst) NVIDIA (0.00 sec, 0.00 inst) intel_gpu (3467.11 sec, 0.00 inst)
12/02/2014 17:52:29 | Einstein@Home | Sending scheduler request: To fetch work.
12/02/2014 17:52:29 | Einstein@Home | Reporting 4 completed tasks
12/02/2014 17:52:29 | Einstein@Home | Requesting new tasks for intel_gpu
12/02/2014 17:52:30 | Einstein@Home | Started upload of 
p2030.20131012.G181.70-02.47.S.b0s0g0.00000_3526_0_0
12/02/2014 17:52:31 | Einstein@Home | Finished upload of 
p2030.20131012.G181.70-02.47.S.b0s0g0.00000_3526_0_0
12/02/2014 17:52:31 |  | [work_fetch] Request work fetch: project finished 
uploading
12/02/2014 17:52:36 | Einstein@Home | Scheduler request completed: got 5 new 
tasks



>________________________________
> From: Richard Haselgrove <r.haselgr...@btopenworld.com>
>To: David Anderson <da...@ssl.berkeley.edu>; BOINC Developers Mailing List 
><boinc_dev@ssl.berkeley.edu> 
>Sent: Tuesday, 11 February 2014, 14:41
>Subject: Re: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate 
>limiting from server
> 
>
>
>Here's an example of the sort of event which can cause the problems Rytis was 
>describing:
>
>
>11/02/2014 14:28:15 | boincsimap | [sched_op] Starting scheduler request
>11/02/2014 14:28:15 | boincsimap | Sending scheduler request: To fetch work.
>11/02/2014 14:28:15 | boincsimap | Requesting new tasks for CPU
>11/02/2014 14:28:15 | boincsimap | [sched_op] CPU work request: 1.46 seconds; 
>0.00 devices
>11/02/2014 14:28:15 | boincsimap | [sched_op] NVIDIA work request: 0.00 
>seconds; 0.00 devices
>11/02/2014 14:28:17 | boincsimap | Scheduler request completed: got 1 new tasks
>11/02/2014 14:28:17 | boincsimap | [sched_op] Server version 703
>11/02/2014 14:28:17 | boincsimap | Project requested delay of 7 seconds
>11/02/2014 14:28:17 | boincsimap | [sched_op] estimated total CPU task 
>duration: 3680 seconds
>11/02/2014 14:28:17 | boincsimap | [sched_op] estimated total NVIDIA task 
>duration: 0 seconds
>11/02/2014 14:28:17 | boincsimap | [sched_op] Deferring communication for 
>00:00:07
>11/02/2014 14:28:17 | boincsimap | [sched_op] Reason: requested by project
>11/02/2014 14:28:19 | boincsimap | Started download of 20140129.556477
>11/02/2014 14:28:24 | boincsimap | Finished download of 20140129.556477
>11/02/2014 14:28:47 | boincsimap | Computation for task 20140129.537727_1 
>finished
>11/02/2014 14:28:47 | boincsimap | Starting task 20140129.540879_1
>11/02/2014 14:28:47 | boincsimap | [cpu_sched] Starting task 20140129.540879_1 
>using simap version 512 in slot 1
>11/02/2014 14:28:49 | boincsimap | Started upload of 20140129.537727_1_0
>11/02/2014 14:29:00 | boincsimap | Finished upload of 20140129.537727_1_0
>
>
>But because work was requested 30 seconds *before* a task completed, neither 
>the old nor the new versions of "inhibit RPCs during upload" would have 
>prevented it.
>
>
>As it happens, SIMAP is one of the projects which could honestly use the 
>"estimates are linear and can be trusted" flag, if available.
>
>
>
>>________________________________
>> From: Richard Haselgrove <r.haselgr...@btopenworld.com>
>>To: David Anderson <da...@ssl.berkeley.edu>; BOINC Developers Mailing List 
>><boinc_dev@ssl.berkeley.edu> 
>>Sent: Saturday, 8 February 2014, 12:08
>>Subject: Re: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate 
>>limiting from server
>> 
>>
>>
>>I thought we had this protection in place already?
>>
>>
>>Specifically, since your checkin 60fc3d3 of April 2011:
>>
>>
>>"client: defer reporting completed tasks if an upload started recently;
>>we might be able to report more tasks once the upload completes."
>>
>>
>>http://boinc.berkeley.edu/trac/changeset/60fc3d3f22f66d7a7b5bb5632d2de322cf2f180a/boinc-v2
>>
>>
>>
>>If that works (and in my experience it does), it exactly covers Rytis' 
>>problem: by delaying work fetch until the previous task is reportable, an 
>>extra slot is made available within the jobs-in-progress limit.
>>
>>
>>It took a few follow-up revisions to get 60fc3d3 working properly: the only 
>>remaining loophole that I can see is that occasionally BOINC might slip in a 
>>work fetch after a task has exited, but before the upload has even started. 
>>The other situation which could lead to Rytis' observation is if BOINC 
>>requested new work shortly before his task exited, but we have always 
>>resisted the calls to adjust scheduling on the basis of anticipated/estimated 
>>completion times.
>>
>>
>>I'm a little worried by the new checkin: if a project completes tasks, and 
>>hence starts uploads, more frequently than once every five minutes, will it 
>>ever break free of the deferral?
>>
>>
>>>________________________________
>>> From: David Anderson <da...@ssl.berkeley.edu>
>>>To: BOINC Developers Mailing List <boinc_dev@ssl.berkeley.edu> 
>>>Sent: Saturday, 8 February 2014, 0:00
>>>Subject: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate 
>>>limiting from server
>>> 
>>>
>>>I checked in the following change to address the problem
>>>Rytis describes below.
>>>
>>>    client: work fetch policy tweak
>>>
>>>    If a project has active uploads, defer work fetch from it for 5 minutes
>>>    even if there are idle devices (that's the change).
>>>    This addresses a situation (reported by Rytis) where
>>>    - a project P has a jobs-in-progress limit less than NCPUS
>>>    - P's jobs
 finish and are uploading
>>>    - the client asks P for work and doesn't get any because of the limit
>>>    - the client does exponential backoff from P
>>>    Over the long term, P can get much less than its fair share of work
>>>
>>>-- David
>>>
>>>-------- Original Message --------
>>>Subject:     Scheduler troubles in conjunction with rate limiting from server
>>>Date:     Fri, 7 Feb 2014 12:41:04 +0200
>>>From:     Rytis Slatkevičius <ry...@gridrepublic.org>
>>>To:     David Anderson <da...@ssl.berkeley.edu>
>>>CC:     Matthew Blumberg <m...@gridrepublic.org>
>>>
>>>
>>>
>>>Hello
 David,
>>>
>>>we observed an interesting trouble with task scheduling:
>>>
>>>Project A (our project) limits number of tasks per proc to 2 and has 
>>>resource share
>>>of 500;
>>>Project B (Einstein) does not limit number of tasks and has resource share 
>>>of 25.
>>>
>>>B has longer tasks than A, and also longer tasks than the minimum work 
>>>buffer.
>>>
>>>When attaching both, A has priority because of resource share. It fetches 2 
>>>tasks
>>>(as the server does not send any more). B then fetches tasks to fill the 
>>>remaining
>>>buffers up to the minimum threshold.
>>>
>>>When A finishes work, scheduler request happens as there is not enough work
>>>available to fill all work slots. However, because the completed tasks have 
>>>not been
>>>uploaded yet, scheduler does not send any new work as it is limited to 2 
>>>tasks on
>>>host (and it still has them, even though computation is complete). Backoff 
>>>happens
>>>for A as no work is provided, and therefore B is asked for
 work. Now only B is running.
>>>
>>>When B finishes work, either A is asked again (if the backoff has 
>>>completed), two
>>>tasks are sent, and process repeats again, or A is not even asked (if the 
>>>backoff is
>>>still in progress) and B is asked again.
>>>
>>>The end result: system runs work from B almost exclusively, even though A 
>>>has work
>>>available (BOINC just thinks it does not). We increased the job limits to a 
>>>number
>>>higher than the minimum threshold and the issue seems to have disappeared.
>>>
>>>--
>>>Pagarbiai / Sincerely
>>>Rytis Slatkevičius
>>>+370 670 77777
>>>
>>>
>>>_______________________________________________
>>>boinc_dev mailing list
>>>boinc_dev@ssl.berkeley.edu
>>>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>>To unsubscribe, visit the
 above URL and
>>>(near bottom of page) enter your email address.
>>>
>>>
>>
>>
>
>
_______________________________________________
boinc_dev mailing list
boinc_dev@ssl.berkeley.edu
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate limiting from server

Reply via email to