A possibility that will work itself out eventually: If a machine has exceeded its computation bound for an application version / plan class for all tasks so far, then the next task for that application version / plan class gets its computation bound multiplied by 2 for the next task for that application version / plan class - until it succeeds in actually completing one. That one would be used as the baseline average computation length for that application version / plan class.
-----Original Message----- From: boinc_dev [mailto:[email protected]] On Behalf Of Josef W. Segur Sent: Wednesday, June 11, 2014 11:38 AM To: David Anderson Cc: [email protected]; Richard Haselgrove Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read) A couple of ideas specifically related to EXIT_TIME_LIMIT_EXCEEDED: 1. The BOINC x86 benchmark based on the Whetstone test has typically yielded a value which is roughly 90% of the CPU clock rate. In terms of a conservative estimate to be used prior to having usable statistics about how the CPU performs on a specific app version, it may be sensible to use clock rate rather than p_fpops. That would not take into account the possibility that hardware or OS may change the clock rate, but neither does the benchmark. 2. When the scheduler assigns a task to a host, it could multiply the rsc_fpops_bound by 5 or 10 if the host does not yet have sufficient results for the app version. -- Joe On Tue, 10 Jun 2014 14:34:57 -0400, David Anderson <[email protected]> wrote: > For credit purposes, the standard is peak FLOPS, > i.e. we give credit for what the device could do, > rather than what it actually did. > Among other things, this encourages projects to develop more efficient apps. > > Currently we're not measuring this well for x86 CPUs, > since our Whetstone benchmark isn't optimized. > Ideally the BOINC client should include variants for the most common > CPU features, as we do for ARM. > > -- D > > On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote: >> Before anybody leaps into making any changes on the basis of that >> observation, I >> think we ought to pause and consider why we have a benchmark, and what we >> use it for. >> >> I'd suggest that in an ideal world, we would be measuring the actual running >> speed >> of (each project's) science applications on that particular host, >> optimisations and >> all. We gradually do this through the runtime averages anyway, but it's hard >> to >> gather a priori data on a new host. >> >> Instead of (initially) measuring science application performance, we measure >> hardware performance as a surrogate. We now have (at least) three ways of >> doing that: >> >> x86: minimum, most conservative, estimate, no optimisations allowed for. >> Android: allows for optimised hardware pathways with vfp or neon, but >> doesn't relate >> back to science app capability. >> GPU: maximum theoretical 'peak flops', calculated from card parameters, then >> scaled >> back by rule of thumb. >> >> Maybe we should standardise on just one standard? >> >> >> ------------------------------------------------------------------------------------ >> *From:* Richard Haselgrove <[email protected]> >> *To:* Josef W. Segur <[email protected]>; David Anderson >> <[email protected]> >> *Cc:* "[email protected]" <[email protected]> >> *Sent:* Tuesday, 10 June 2014, 9:37 >> *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me >> again, but >> please read) >> >> >> http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea >> >> >________________________________ >> > From: Josef W. Segur <[email protected] >> <mailto:[email protected]>> >> >To: David Anderson <[email protected] >> <mailto:[email protected]>> >> >Cc: "[email protected] <mailto:[email protected]>" >> <[email protected] <mailto:[email protected]>>; Eric J >> Korpela >> <[email protected] <mailto:[email protected]>>; Richard >> Haselgrove >> <[email protected] <mailto:[email protected]>> >> >Sent: Tuesday, 10 June 2014, 2:19 >> >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me >> again, but >> please read) >> > >> > >> >Consider Richard's observation: >> > >> >>> It appears that the Android Whetstone benchmark used in the >> BOINC >> client has >> >>> separate code paths for ARM, vfp, and NEON processors: a vfp or >> NEON >> processor >> >>> will report that it is significantly faster than a >> plain-vanilla ARM. >> > >> >If that is so, it distinctly differs from the x86 Whetstone which >> never uses >> SIMD, and is truly conservative as you would want for 3). >> >-- >> > Joe >> > >> > >> > >> >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson >> <[email protected] >> <mailto:[email protected]>> wrote: >> > >> >> Eric: >> >> >> >> Yes, I suspect that's what's going on. >> >> Currently the logic for estimating job runtime >> >> (estimate_flops() in sched_version.cpp) is >> >> 1) if this (host, app version) has > 10 results, use (host, app >> version) >> statistics >> >> 2) if this app version has > 100 results, use app version statistics >> >> 3) else use a conservative estimate based on p_fpops. >> >> >> >> I'm not sure we should be doing 2) at all, >> >> since as you point out the first x100 or 1000 results for an app >> version >> >> will generally be from the fastest devices >> >> (and even in the steady state, >> >> app version statistics disproportionately reflect fast devices). >> >> >> >> I'll make this change. >> >> >> >> -- David >> >> >> >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote: >> >>> I also don't have direct access to the server as well, so I'm >> mostly guessing. >> >>> Having separate benchmarks for neon and VFP means there's a broad >> bimodal >> >>> distribution for the benchmark results. Where the mean falls >> depends upon >> the mix >> >>> of machines. In general the neon machines (being newer and faster) >> will report >> >>> first and more often, so early on the PFC distribution will reflect >> the fast >> >>> machines. Slower machines will be underweighted. So the work will >> be >> estimated to >> >>> complete quickly, and some machines will time out. In SETI beta, it >> resolves itself >> >>> in a few weeks. I can't guarantee that it will anywhere else. >> >>> >> >>> We see this with every release of a GPU app. The real capabilities >> of graphics >> >>> cards vary by orders of magnitude from the estimate and by more >> from each >> other. >> >>> The fast cards report first and most every else hits days of >> timeouts. >> >>> >> >>> One possible fix so to increase the timeout limits for the first 10 >> workunits for a >> >>> host_app_version, until host based estimates take over. >> >>> >> >>> >> >>> >> >>> >> >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove >> <[email protected] <mailto:[email protected]> >> >>> <mailto:[email protected] >> <mailto:[email protected]>>> wrote: >> >>> >> >>> I think Eric Korpela would be the best person to answer that >> question, >> but I >> >>> suspect 'probably not': further investigation over the weekend >> suggests >> that the >> >>> circumstances may be SIMAP-specific. >> >>> >> >>> It appears that the Android Whetstone benchmark used in the >> BOINC >> client has >> >>> separate code paths for ARM, vfp, and NEON processors: a vfp or >> NEON >> processor >> >>> will report that it is significantly faster than a >> plain-vanilla ARM. >> >>> >> >>> However, SIMAP have only deployed a single Android app, which >> I'm >> assuming only >> >>> uses ARM functions: devices with vfp or NEON SIMD vectorisation >> available would >> >>> run the non-optimised application much slower than BOINC >> expects. >> >>> >> >>> At my suggestion, Thomas Rattei (SIMAP admistrator) increased >> the >> >>> rsc_fpops_bound multiplier to 10x on Sunday afternoon. I note >> that the >> maximum >> >>> runtime displayed on >> http://boincsimap.org/boincsimap/server_status.php has >> >>> already increased from 11 hours to 14 hours since he did that. >> >>> >> >>> Thomas has told me "We've seen that [EXIT_TIME_LIMIT_EXCEEDED] >> a lot. >> However, >> >>> due to Samsung PowerSleep, we thought these are mainly "lazy" >> users >> just not >> >>> using their phone regularly for computing." He's going to >> monitor how this >> >>> progresses during the remainder of the current batch, and I've >> asked >> him to keep >> >>> us updated on his observations. >> >>> >> >>> >> >>> >> >>> >________________________________ >> >>> > From: David Anderson <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>>> >> >>> >To: [email protected] >> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >> >>> >Sent: Monday, 9 June 2014, 3:48 >> >>> >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes >> me >> again, but >> >>> please read) >> >>> > >> >>> > >> >>> >Does this problem occur on SETI@home? >> >>> >-- David >> >>> > >> >>> >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote: >> >>> > >> >>> >> 2) Android runtime estimates >> >>> >> >> >>> >> The example here is from SIMAP. During a recent pause >> between >> batches, I noticed >> >>> >> that some of my 'pending validation' tasks were being slow >> to clear: >> >>> >> http://boincsimap.org/boincsimap/results.php?hostid=349248 >> >>> >> >> >>> >> The clearest example is the third of those three workunits: >> >>> >> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928 >> >>> >> >> >>> >> Four of the seven replications have failed with 'Error while >> computing', and >> >>> >> every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an >> Android device. >> >>> >> >> >>> >> Three of the four hosts have never returned a valid result >> (total >> credit zero), >> >>> >> so they have never had a chance to establish an APR for use >> in runtime >> >>> >> estimation: runtime estimates and bounds must have been >> generated >> by the server. >> >>> >> >> >>> >> It seems - from these results, and others I've found >> pending on >> other machines - >> >>> >> that SIMAP tasks on Android are aborted with >> EXIT_TIME_LIMIT_EXCEEDED after ~6 >> >>> >> hours elapsed. For the new batch released today, SIMAP are >> using a >> 3x bound >> >>> >> (which may be a bit low under the circumstances): >> >>> >> >> >>> >> <rsc_fpops_est>13500000000000.000000</rsc_fpops_est> >> >>> >> <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound> >> >>> >> >> >>> >> so I deduce that the tasks when first issued had a runtime >> estimate >> of ~2 hours. >> >>> >> >> >>> >> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 >> GFLOPS), >> take over half >> >>> >> an hour to complete: two hours for an ARM device sounds >> suspiciously low. The >> >>> >> only one of my Android wingmates to have registered an APR >> >>> >> >> (http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is >> >>> showing >> >>> >> 1.69 GFLOPS, but I have no way of knowing whether that APR >> was >> established >> >>> before >> >>> >> or after the task in question errored out. >> >>> >> >> >>> >> From experience - borne out by current tests at >> Albert@Home, where >> server logs >> >>> >> are helpfully exposed to the public - initial server >> estimates can >> be hopelessly >> >>> >> over-optimistic. These two are for the same machine: >> >>> >> >> >>> >> 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716] >> (BRP4G-cuda32-nv301) >> >>> >> adjusting projected flops based on PFC avg: 2124.60G >> 2014-06-07 >> 09:30:56.1506 >> >>> >> [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301) setting >> projected flops >> >>> based >> >>> >> on host elapsed time avg: 23.71G >> >>> >> >> >>> >> Since SIMAP have recently announced that they are leaving >> the BOINC >> platform at >> >>> >> the end of the year (despite being an Android launch >> partner with >> Samsung), I >> >>> >> doubt they'll want to put much effort into researching this >> issue. >> >>> >> >> >>> >> But if other projects experimenting with Android >> applications are >> experiencing a >> >>> >> high task failure rate, they might like to check whether >> >>> EXIT_TIME_LIMIT_EXCEEDED >> >>> >> is a significant factor in those failures, and if so, >> consider the >> other >> >>> >> remediation approaches (apart from outliers, which isn't >> relevant >> in this case) >> >>> >> that I suggested to Eric Mcintosh at LHC. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
