And my HD7770 is getting the following at Albert because it hasn't finished it's 11 validations for it's app_version yet: 2014-06-05 09:56:29.7913 [PID=7201 ] [version] looking for version of einsteinbinary_BRP4G 2014-06-05 09:56:29.7913 [PID=7201 ] [version] Checking plan class 'BRP4G-opencl-ati' 2014-06-05 09:56:29.7913 [PID=7201 ] [version] plan_class_spec: parsed project prefs setting 'gpu_util_brp' : true : 1.000000 2014-06-05 09:56:29.7913 [PID=7201 ] [version] [AV#721] (BRP4G-opencl-ati) adjusting projected flops based on PFC avg: 34968.78G 2014-06-05 09:56:29.7913 [PID=7201 ] [version] Best app version is now AV721 (18620.28 GFLOP) 2014-06-05 09:56:29.7913 [PID=7201 ] [version] [AV#721] (BRP4G-opencl-ati) adjusting projected flops based on PFC avg: 34968.78G 2014-06-05 09:56:29.7914 [PID=7201 ] [version] Best version of app einsteinbinary_BRP4G is [AV#721] (34968.78 GFLOPS) 2014-06-05 09:56:29.7974 [PID=7201 ] [send] Sending app_version einsteinbinary_BRP4G 7 134 BRP4G-opencl-ati; projected 34968.78 GFLOPS 2014-06-05 09:56:29.7976 [PID=7201 ] [send] est. duration for WU 606407: unscaled 8.01 scaled 10.96 2014-06-05 09:56:29.7976 [PID=7201 ] [send] [HOST#8143] sending [RESULT#1454943 p2030.20131124.G176.16-01.04.S.b2s0g0.00000_3616_1] (est. dur. 10.96s (0h00m10s95)) (max time 160.14s (0h02m40s14))
Real duration is going to be something like an hour, and not the 11 seconds it expects it to be done in!! https://albert.phys.uwm.edu/results.php?hostid=8143&offset=0&show_names=0&state=5&appid=29 Claggy > Date: Sat, 7 Jun 2014 10:51:16 +0100 > From: [email protected] > To: [email protected] > Subject: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but > please read) > > And bad form, with two separate issues to report. Sorry again. > > 1) Use of outlier detection to avoid skewed averages > 2) Initial runtime estimates on the Android platform > > 1) Outlier detection. > > This arises from the recent introduction of a new app_version at the > LHCclassic project. LHC, by its very nature, is searching for the onset of > chaotic orbital behaviour in the simulated particle beam: they expect, and > actively want, many tasks to finish early. > > Eric Mcintosh commented in a recent 'lessons learned' news item - > http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=3838 - that > EXIT_TIME_LIMIT_EXCEEDED was his #1 problem following the new version > release. I've advised accordingly in that thread. > > But I was surprised to find that outlier detection - an appropriate solution > to this particular case - wasn't documented in the developer Wiki: a > trac/wiki search only returns a single hit for 'outlier', and that's in > http://boinc.berkeley.edu/trac/wiki/ServerUpdates - which we seem to have > stopped updating. The one-line summary doesn't give much of a clue about when > and why this feature might be useful, and without a git translation the SVN > reference doesn't help either. > > http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=e49f9459080b488f85fbcf8cdad6db9672416cf8 > > > 2) Android runtime estimates > > The example here is from SIMAP. During a recent pause between batches, I > noticed that some of my 'pending validation' tasks were being slow to clear: > http://boincsimap.org/boincsimap/results.php?hostid=349248 > > The clearest example is the third of those three workunits: > http://boincsimap.org/boincsimap/workunit.php?wuid=57169928 > > Four of the seven replications have failed with 'Error while computing', and > every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an Android device. > > Three of the four hosts have never returned a valid result (total credit > zero), so they have never had a chance to establish an APR for use in runtime > estimation: runtime estimates and bounds must have been generated by the > server. > > It seems - from these results, and others I've found pending on other > machines - that SIMAP tasks on Android are aborted with > EXIT_TIME_LIMIT_EXCEEDED after ~6 hours elapsed. For the new batch released > today, SIMAP are using a 3x bound (which may be a bit low under the > circumstances): > > <rsc_fpops_est>13500000000000.000000</rsc_fpops_est> > <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound> > > so I deduce that the tasks when first issued had a runtime estimate of ~2 > hours. > > My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 GFLOPS), take over > half an hour to complete: two hours for an ARM device sounds suspiciously > low. The only one of my Android wingmates to have registered an APR > (http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is > showing 1.69 GFLOPS, but I have no way of knowing whether that APR was > established before or after the task in question errored out. > > From experience - borne out by current tests at Albert@Home, where server > logs are helpfully exposed to the public - initial server estimates can be > hopelessly over-optimistic. These two are for the same machine: > > 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716] (BRP4G-cuda32-nv301) > adjusting projected flops based on PFC avg: 2124.60G > 2014-06-07 09:30:56.1506 [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301) > setting projected flops based on host elapsed time avg: 23.71G > > Since SIMAP have recently announced that they are leaving the BOINC platform > at the end of the year (despite being an Android launch partner with > Samsung), I doubt they'll want to put much effort into researching this issue. > > But if other projects experimenting with Android applications are > experiencing a high task failure rate, they might like to check whether > EXIT_TIME_LIMIT_EXCEEDED is a significant factor in those failures, and if > so, consider the other remediation approaches (apart from outliers, which > isn't relevant in this case) that I suggested to Eric Mcintosh at LHC. > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
