This, on the other hand, is a case which should have been prevented by the old (existing) code: work fetch initiated between app exit and start of upload. It would be better to fix that, rather than introduce a whole new case handler.
12/02/2014 17:52:28 | | [work_fetch] Request work fetch: application exited 12/02/2014 17:52:28 | Einstein@Home | Computation for task p2030.20131012.G181.70-02.47.S.b0s0g0.00000_3526_0 finished 12/02/2014 17:52:28 | Einstein@Home | Starting task p2030.20131012.G181.56-02.69.S.b6s0g0.00000_2733_1 using einsteinbinary_BRP4 version 134 (opencl-intel_gpu) in slot 0 12/02/2014 17:52:29 | Einstein@Home | [work_fetch] set_request() for intel_gpu: ninst 1 nused_total 35.000000 nidle_now 0.000000 fetch share 1.000000 req_inst 0.000000 req_secs 3467.114311 12/02/2014 17:52:29 | Einstein@Home | [sched_op] Starting scheduler request 12/02/2014 17:52:29 | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA (0.00 sec, 0.00 inst) intel_gpu (3467.11 sec, 0.00 inst) 12/02/2014 17:52:29 | Einstein@Home | Sending scheduler request: To fetch work. 12/02/2014 17:52:29 | Einstein@Home | Reporting 4 completed tasks 12/02/2014 17:52:29 | Einstein@Home | Requesting new tasks for intel_gpu 12/02/2014 17:52:30 | Einstein@Home | Started upload of p2030.20131012.G181.70-02.47.S.b0s0g0.00000_3526_0_0 12/02/2014 17:52:31 | Einstein@Home | Finished upload of p2030.20131012.G181.70-02.47.S.b0s0g0.00000_3526_0_0 12/02/2014 17:52:31 | | [work_fetch] Request work fetch: project finished uploading 12/02/2014 17:52:36 | Einstein@Home | Scheduler request completed: got 5 new tasks >________________________________ > From: Richard Haselgrove <r.haselgr...@btopenworld.com> >To: David Anderson <da...@ssl.berkeley.edu>; BOINC Developers Mailing List ><boinc_dev@ssl.berkeley.edu> >Sent: Tuesday, 11 February 2014, 14:41 >Subject: Re: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate >limiting from server > > > >Here's an example of the sort of event which can cause the problems Rytis was >describing: > > >11/02/2014 14:28:15 | boincsimap | [sched_op] Starting scheduler request >11/02/2014 14:28:15 | boincsimap | Sending scheduler request: To fetch work. >11/02/2014 14:28:15 | boincsimap | Requesting new tasks for CPU >11/02/2014 14:28:15 | boincsimap | [sched_op] CPU work request: 1.46 seconds; >0.00 devices >11/02/2014 14:28:15 | boincsimap | [sched_op] NVIDIA work request: 0.00 >seconds; 0.00 devices >11/02/2014 14:28:17 | boincsimap | Scheduler request completed: got 1 new tasks >11/02/2014 14:28:17 | boincsimap | [sched_op] Server version 703 >11/02/2014 14:28:17 | boincsimap | Project requested delay of 7 seconds >11/02/2014 14:28:17 | boincsimap | [sched_op] estimated total CPU task >duration: 3680 seconds >11/02/2014 14:28:17 | boincsimap | [sched_op] estimated total NVIDIA task >duration: 0 seconds >11/02/2014 14:28:17 | boincsimap | [sched_op] Deferring communication for >00:00:07 >11/02/2014 14:28:17 | boincsimap | [sched_op] Reason: requested by project >11/02/2014 14:28:19 | boincsimap | Started download of 20140129.556477 >11/02/2014 14:28:24 | boincsimap | Finished download of 20140129.556477 >11/02/2014 14:28:47 | boincsimap | Computation for task 20140129.537727_1 >finished >11/02/2014 14:28:47 | boincsimap | Starting task 20140129.540879_1 >11/02/2014 14:28:47 | boincsimap | [cpu_sched] Starting task 20140129.540879_1 >using simap version 512 in slot 1 >11/02/2014 14:28:49 | boincsimap | Started upload of 20140129.537727_1_0 >11/02/2014 14:29:00 | boincsimap | Finished upload of 20140129.537727_1_0 > > >But because work was requested 30 seconds *before* a task completed, neither >the old nor the new versions of "inhibit RPCs during upload" would have >prevented it. > > >As it happens, SIMAP is one of the projects which could honestly use the >"estimates are linear and can be trusted" flag, if available. > > > >>________________________________ >> From: Richard Haselgrove <r.haselgr...@btopenworld.com> >>To: David Anderson <da...@ssl.berkeley.edu>; BOINC Developers Mailing List >><boinc_dev@ssl.berkeley.edu> >>Sent: Saturday, 8 February 2014, 12:08 >>Subject: Re: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate >>limiting from server >> >> >> >>I thought we had this protection in place already? >> >> >>Specifically, since your checkin 60fc3d3 of April 2011: >> >> >>"client: defer reporting completed tasks if an upload started recently; >>we might be able to report more tasks once the upload completes." >> >> >>http://boinc.berkeley.edu/trac/changeset/60fc3d3f22f66d7a7b5bb5632d2de322cf2f180a/boinc-v2 >> >> >> >>If that works (and in my experience it does), it exactly covers Rytis' >>problem: by delaying work fetch until the previous task is reportable, an >>extra slot is made available within the jobs-in-progress limit. >> >> >>It took a few follow-up revisions to get 60fc3d3 working properly: the only >>remaining loophole that I can see is that occasionally BOINC might slip in a >>work fetch after a task has exited, but before the upload has even started. >>The other situation which could lead to Rytis' observation is if BOINC >>requested new work shortly before his task exited, but we have always >>resisted the calls to adjust scheduling on the basis of anticipated/estimated >>completion times. >> >> >>I'm a little worried by the new checkin: if a project completes tasks, and >>hence starts uploads, more frequently than once every five minutes, will it >>ever break free of the deferral? >> >> >>>________________________________ >>> From: David Anderson <da...@ssl.berkeley.edu> >>>To: BOINC Developers Mailing List <boinc_dev@ssl.berkeley.edu> >>>Sent: Saturday, 8 February 2014, 0:00 >>>Subject: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate >>>limiting from server >>> >>> >>>I checked in the following change to address the problem >>>Rytis describes below. >>> >>> client: work fetch policy tweak >>> >>> If a project has active uploads, defer work fetch from it for 5 minutes >>> even if there are idle devices (that's the change). >>> This addresses a situation (reported by Rytis) where >>> - a project P has a jobs-in-progress limit less than NCPUS >>> - P's jobs finish and are uploading >>> - the client asks P for work and doesn't get any because of the limit >>> - the client does exponential backoff from P >>> Over the long term, P can get much less than its fair share of work >>> >>>-- David >>> >>>-------- Original Message -------- >>>Subject: Scheduler troubles in conjunction with rate limiting from server >>>Date: Fri, 7 Feb 2014 12:41:04 +0200 >>>From: Rytis Slatkevičius <ry...@gridrepublic.org> >>>To: David Anderson <da...@ssl.berkeley.edu> >>>CC: Matthew Blumberg <m...@gridrepublic.org> >>> >>> >>> >>>Hello David, >>> >>>we observed an interesting trouble with task scheduling: >>> >>>Project A (our project) limits number of tasks per proc to 2 and has >>>resource share >>>of 500; >>>Project B (Einstein) does not limit number of tasks and has resource share >>>of 25. >>> >>>B has longer tasks than A, and also longer tasks than the minimum work >>>buffer. >>> >>>When attaching both, A has priority because of resource share. It fetches 2 >>>tasks >>>(as the server does not send any more). B then fetches tasks to fill the >>>remaining >>>buffers up to the minimum threshold. >>> >>>When A finishes work, scheduler request happens as there is not enough work >>>available to fill all work slots. However, because the completed tasks have >>>not been >>>uploaded yet, scheduler does not send any new work as it is limited to 2 >>>tasks on >>>host (and it still has them, even though computation is complete). Backoff >>>happens >>>for A as no work is provided, and therefore B is asked for work. Now only B is running. >>> >>>When B finishes work, either A is asked again (if the backoff has >>>completed), two >>>tasks are sent, and process repeats again, or A is not even asked (if the >>>backoff is >>>still in progress) and B is asked again. >>> >>>The end result: system runs work from B almost exclusively, even though A >>>has work >>>available (BOINC just thinks it does not). We increased the job limits to a >>>number >>>higher than the minimum threshold and the issue seems to have disappeared. >>> >>>-- >>>Pagarbiai / Sincerely >>>Rytis Slatkevičius >>>+370 670 77777 >>> >>> >>>_______________________________________________ >>>boinc_dev mailing list >>>boinc_dev@ssl.berkeley.edu >>>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>>To unsubscribe, visit the above URL and >>>(near bottom of page) enter your email address. >>> >>> >> >> > > _______________________________________________ boinc_dev mailing list boinc_dev@ssl.berkeley.edu http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.