http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea
>________________________________ > From: Josef W. Segur <[email protected]> >To: David Anderson <[email protected]> >Cc: "[email protected]" <[email protected]>; Eric J Korpela ><[email protected]>; Richard Haselgrove <[email protected]> >Sent: Tuesday, 10 June 2014, 2:19 >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but >please read) > > >Consider Richard's observation: > >>> It appears that the Android Whetstone benchmark used in the BOINC >>>client has >>> separate code paths for ARM, vfp, and NEON processors: a vfp or NEON >>>processor >>> will report that it is significantly faster than a plain-vanilla ARM. > >If that is so, it distinctly differs from the x86 Whetstone which never uses >SIMD, and is truly conservative as you would want for 3). >-- > Joe > > > >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson <[email protected]> >wrote: > >> Eric: >> >> Yes, I suspect that's what's going on. >> Currently the logic for estimating job runtime >> (estimate_flops() in sched_version.cpp) is >> 1) if this (host, app version) has > 10 results, use (host, app version) >> statistics >> 2) if this app version has > 100 results, use app version statistics >> 3) else use a conservative estimate based on p_fpops. >> >> I'm not sure we should be doing 2) at all, >> since as you point out the first x100 or 1000 results for an app version >> will generally be from the fastest devices >> (and even in the steady state, >> app version statistics disproportionately reflect fast devices). >> >> I'll make this change. >> >> -- David >> >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote: >>> I also don't have direct access to the server as well, so I'm mostly >>> guessing. >>> Having separate benchmarks for neon and VFP means there's a broad bimodal >>> distribution for the benchmark results. Where the mean falls depends upon >>> the mix >>> of machines. In general the neon machines (being newer and faster) will >>> report >>> first and more often, so early on the PFC distribution will reflect the fast >>> machines. Slower machines will be underweighted. So the work will be >>> estimated to >>> complete quickly, and some machines will time out. In SETI beta, it >>> resolves itself >>> in a few weeks. I can't guarantee that it will anywhere else. >>> >>> We see this with every release of a GPU app. The real capabilities of >>> graphics >>> cards vary by orders of magnitude from the estimate and by more from each >>> other. >>> The fast cards report first and most every else hits days of timeouts. >>> >>> One possible fix so to increase the timeout limits for the first 10 >>> workunits for a >>> host_app_version, until host based estimates take over. >>> >>> >>> >>> >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove >>> <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> I think Eric Korpela would be the best person to answer that question, >>>but I >>> suspect 'probably not': further investigation over the weekend suggests >>>that the >>> circumstances may be SIMAP-specific. >>> >>> It appears that the Android Whetstone benchmark used in the BOINC >>>client has >>> separate code paths for ARM, vfp, and NEON processors: a vfp or NEON >>>processor >>> will report that it is significantly faster than a plain-vanilla ARM. >>> >>> However, SIMAP have only deployed a single Android app, which I'm >>>assuming only >>> uses ARM functions: devices with vfp or NEON SIMD vectorisation >>>available would >>> run the non-optimised application much slower than BOINC expects. >>> >>> At my suggestion, Thomas Rattei (SIMAP admistrator) increased the >>> rsc_fpops_bound multiplier to 10x on Sunday afternoon. I note that the >>>maximum >>> runtime displayed on http://boincsimap.org/boincsimap/server_status.php >>>has >>> already increased from 11 hours to 14 hours since he did that. >>> >>> Thomas has told me "We've seen that [EXIT_TIME_LIMIT_EXCEEDED] a lot. >>>However, >>> due to Samsung PowerSleep, we thought these are mainly "lazy" users >>>just not >>> using their phone regularly for computing." He's going to monitor how >>>this >>> progresses during the remainder of the current batch, and I've asked >>>him to keep >>> us updated on his observations. >>> >>> >>> >>> >________________________________ >>> > From: David Anderson <[email protected] >>><mailto:[email protected]>> >>> >To: [email protected] <mailto:[email protected]> >>> >Sent: Monday, 9 June 2014, 3:48 >>> >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me >>>again, but >>> please read) >>> > >>> > >>> >Does this problem occur on SETI@home? >>> >-- David >>> > >>> >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote: >>> > >>> >> 2) Android runtime estimates >>> >> >>> >> The example here is from SIMAP. During a recent pause between >>>batches, I noticed >>> >> that some of my 'pending validation' tasks were being slow to clear: >>> >> http://boincsimap.org/boincsimap/results.php?hostid=349248 >>> >> >>> >> The clearest example is the third of those three workunits: >>> >> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928 >>> >> >>> >> Four of the seven replications have failed with 'Error while >>>computing', and >>> >> every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an >>>Android device. >>> >> >>> >> Three of the four hosts have never returned a valid result (total >>>credit zero), >>> >> so they have never had a chance to establish an APR for use in >>>runtime >>> >> estimation: runtime estimates and bounds must have been generated >>>by the server. >>> >> >>> >> It seems - from these results, and others I've found pending on >>>other machines - >>> >> that SIMAP tasks on Android are aborted with >>>EXIT_TIME_LIMIT_EXCEEDED after ~6 >>> >> hours elapsed. For the new batch released today, SIMAP are using a >>>3x bound >>> >> (which may be a bit low under the circumstances): >>> >> >>> >> <rsc_fpops_est>13500000000000.000000</rsc_fpops_est> >>> >> <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound> >>> >> >>> >> so I deduce that the tasks when first issued had a runtime estimate >>>of ~2 hours. >>> >> >>> >> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 GFLOPS), >>>take over half >>> >> an hour to complete: two hours for an ARM device sounds >>>suspiciously low. The >>> >> only one of my Android wingmates to have registered an APR >>> >> >>>(http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is >>> showing >>> >> 1.69 GFLOPS, but I have no way of knowing whether that APR was >>>established >>> before >>> >> or after the task in question errored out. >>> >> >>> >> From experience - borne out by current tests at Albert@Home, where >>>server logs >>> >> are helpfully exposed to the public - initial server estimates can >>>be hopelessly >>> >> over-optimistic. These two are for the same machine: >>> >> >>> >> 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716] >>>(BRP4G-cuda32-nv301) >>> >> adjusting projected flops based on PFC avg: 2124.60G 2014-06-07 >>>09:30:56.1506 >>> >> [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301) setting >>>projected flops >>> based >>> >> on host elapsed time avg: 23.71G >>> >> >>> >> Since SIMAP have recently announced that they are leaving the BOINC >>>platform at >>> >> the end of the year (despite being an Android launch partner with >>>Samsung), I >>> >> doubt they'll want to put much effort into researching this issue. >>> >> >>> >> But if other projects experimenting with Android applications are >>>experiencing a >>> >> high task failure rate, they might like to check whether >>> EXIT_TIME_LIMIT_EXCEEDED >>> >> is a significant factor in those failures, and if so, consider the >>>other >>> >> remediation approaches (apart from outliers, which isn't relevant >>>in this case) >>> >> that I suggested to Eric Mcintosh at LHC. > > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
