The 'projected flops based on PFC avg' reports come from the 'Albert' test server, which has recently been updated (within the last 7 days) to the current BOINC master server code. Germany has a public holiday today, but I'm sure Bernd will help facilitate investigations into this issue when time permits.
>________________________________ > From: David Anderson <[email protected]> >To: [email protected] >Sent: Monday, 9 June 2014, 2:54 >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but >please read) > > >This is an Einstein-specific issue >(at least, I can't debug it without looking at their server). >-- David > >On 07-Jun-2014 3:44 AM, Stephen Maclagan wrote: >> And my HD7770 is getting the following at Albert because it hasn't finished >> it's 11 validations for it's app_version yet: >> >> 2014-06-05 09:56:29.7913 [PID=7201 ] [version] looking for version of >> einsteinbinary_BRP4G >> 2014-06-05 09:56:29.7913 [PID=7201 ] [version] Checking plan class >>'BRP4G-opencl-ati' >> 2014-06-05 09:56:29.7913 [PID=7201 ] [version] plan_class_spec: parsed >>project prefs setting 'gpu_util_brp' : true : 1.000000 >> 2014-06-05 09:56:29.7913 [PID=7201 ] [version] [AV#721] >>(BRP4G-opencl-ati) adjusting projected flops based on PFC avg: 34968.78G >> 2014-06-05 09:56:29.7913 [PID=7201 ] [version] Best app version is now >>AV721 (18620.28 GFLOP) >> 2014-06-05 09:56:29.7913 [PID=7201 ] [version] [AV#721] >>(BRP4G-opencl-ati) adjusting projected flops based on PFC avg: 34968.78G >> 2014-06-05 09:56:29.7914 [PID=7201 ] [version] Best version of app >>einsteinbinary_BRP4G is [AV#721] (34968.78 GFLOPS) >> >> 2014-06-05 09:56:29.7974 [PID=7201 ] [send] Sending app_version >>einsteinbinary_BRP4G 7 134 BRP4G-opencl-ati; projected 34968.78 GFLOPS >> 2014-06-05 09:56:29.7976 [PID=7201 ] [send] est. duration for WU >>606407: unscaled 8.01 scaled 10.96 >> 2014-06-05 09:56:29.7976 [PID=7201 ] [send] [HOST#8143] sending >>[RESULT#1454943 p2030.20131124.G176.16-01.04.S.b2s0g0.00000_3616_1] (est. >>dur. 10.96s (0h00m10s95)) (max time 160.14s (0h02m40s14)) >> >> Real duration is going to be something like an hour, and not the 11 seconds >> it expects it to be done in!! >> >> https://albert.phys.uwm.edu/results.php?hostid=8143&offset=0&show_names=0&state=5&appid=29 >> >> Claggy >> >> >> >>> Date: Sat, 7 Jun 2014 10:51:16 +0100 >>> From: [email protected] >>> To: [email protected] >>> Subject: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but >>> please read) >>> >>> And bad form, with two separate issues to report. Sorry again. >>> >>> 1) Use of outlier detection to avoid skewed averages >>> 2) Initial runtime estimates on the Android platform >>> >>> 1) Outlier detection. >>> >>> This arises from the recent introduction of a new app_version at the >>> LHCclassic project. LHC, by its very nature, is searching for the onset of >>> chaotic orbital behaviour in the simulated particle beam: they expect, and >>> actively want, many tasks to finish early. >>> >>> Eric Mcintosh commented in a recent 'lessons learned' news item - >>> http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=3838 - that >>> EXIT_TIME_LIMIT_EXCEEDED was his #1 problem following the new version >>> release. I've advised accordingly in that thread. >>> >>> But I was surprised to find that outlier detection - an appropriate >>> solution to this particular case - wasn't documented in the developer Wiki: >>> a trac/wiki search only returns a single hit for 'outlier', and that's in >>> http://boinc.berkeley.edu/trac/wiki/ServerUpdates - which we seem to have >>> stopped updating. The one-line summary doesn't give much of a clue about >>> when and why this feature might be useful, and without a git translation >>> the SVN reference doesn't help either. >>> >>> http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=e49f9459080b488f85fbcf8cdad6db9672416cf8 >>> >>> >>> 2) Android runtime estimates >>> >>> The example here is from SIMAP. During a recent pause between batches, I >>> noticed that some of my 'pending validation' tasks were being slow to >>> clear: http://boincsimap.org/boincsimap/results.php?hostid=349248 >>> >>> The clearest example is the third of those three workunits: >>> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928 >>> >>> Four of the seven replications have failed with 'Error while computing', >>> and every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an Android >>> device. >>> >>> Three of the four hosts have never returned a valid result (total credit >>> zero), so they have never had a chance to establish an APR for use in >>> runtime estimation: runtime estimates and bounds must have been generated >>> by the server. >>> >>> It seems - from these results, and others I've found pending on other >>> machines - that SIMAP tasks on Android are aborted with >>> EXIT_TIME_LIMIT_EXCEEDED after ~6 hours elapsed. For the new batch released >>> today, SIMAP are using a 3x bound (which may be a bit low under the >>> circumstances): >>> >>> <rsc_fpops_est>13500000000000.000000</rsc_fpops_est> >>> <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound> >>> >>> so I deduce that the tasks when first issued had a runtime estimate of ~2 >>> hours. >>> >>> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 GFLOPS), take over >>> half an hour to complete: two hours for an ARM device sounds suspiciously >>> low. The only one of my Android wingmates to have registered an APR >>> (http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is >>> showing 1.69 GFLOPS, but I have no way of knowing whether that APR was >>> established before or after the task in question errored out. >>> >>> From experience - borne out by current tests at Albert@Home, where server >>>logs are helpfully exposed to the public - initial server estimates can be >>>hopelessly over-optimistic. These two are for the same machine: >>> >>> 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716] >>> (BRP4G-cuda32-nv301) adjusting projected flops based on PFC avg: 2124.60G >>> 2014-06-07 09:30:56.1506 [PID=10808] [version] [AV#716] >>> (BRP4G-cuda32-nv301) setting projected flops based on host elapsed time >>> avg: 23.71G >>> >>> Since SIMAP have recently announced that they are leaving the BOINC >>> platform at the end of the year (despite being an Android launch partner >>> with Samsung), I doubt they'll want to put much effort into researching >>> this issue. >>> >>> But if other projects experimenting with Android applications are >>> experiencing a high task failure rate, they might like to check whether >>> EXIT_TIME_LIMIT_EXCEEDED is a significant factor in those failures, and if >>> so, consider the other remediation approaches (apart from outliers, which >>> isn't relevant in this case) that I suggested to Eric Mcintosh at LHC. >>> _______________________________________________ >>> boinc_dev mailing list >>> [email protected] >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>> To unsubscribe, visit the above URL and >>> (near bottom of page) enter your email address. >> >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >> >_______________________________________________ >boinc_dev mailing list >[email protected] >http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >To unsubscribe, visit the above URL and >(near bottom of page) enter your email address. > > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
