This is an Einstein-specific issue
(at least, I can't debug it without looking at their server).
-- David
On 07-Jun-2014 3:44 AM, Stephen Maclagan wrote:
And my HD7770 is getting the following at Albert because it hasn't finished
it's 11 validations for it's app_version yet:
2014-06-05 09:56:29.7913 [PID=7201 ] [version] looking for version of
einsteinbinary_BRP4G
2014-06-05 09:56:29.7913 [PID=7201 ] [version] Checking plan class
'BRP4G-opencl-ati'
2014-06-05 09:56:29.7913 [PID=7201 ] [version] plan_class_spec: parsed
project prefs setting 'gpu_util_brp' : true : 1.000000
2014-06-05 09:56:29.7913 [PID=7201 ] [version] [AV#721] (BRP4G-opencl-ati)
adjusting projected flops based on PFC avg: 34968.78G
2014-06-05 09:56:29.7913 [PID=7201 ] [version] Best app version is now
AV721 (18620.28 GFLOP)
2014-06-05 09:56:29.7913 [PID=7201 ] [version] [AV#721] (BRP4G-opencl-ati)
adjusting projected flops based on PFC avg: 34968.78G
2014-06-05 09:56:29.7914 [PID=7201 ] [version] Best version of app
einsteinbinary_BRP4G is [AV#721] (34968.78 GFLOPS)
2014-06-05 09:56:29.7974 [PID=7201 ] [send] Sending app_version
einsteinbinary_BRP4G 7 134 BRP4G-opencl-ati; projected 34968.78 GFLOPS
2014-06-05 09:56:29.7976 [PID=7201 ] [send] est. duration for WU 606407:
unscaled 8.01 scaled 10.96
2014-06-05 09:56:29.7976 [PID=7201 ] [send] [HOST#8143] sending
[RESULT#1454943 p2030.20131124.G176.16-01.04.S.b2s0g0.00000_3616_1] (est. dur.
10.96s (0h00m10s95)) (max time 160.14s (0h02m40s14))
Real duration is going to be something like an hour, and not the 11 seconds it
expects it to be done in!!
https://albert.phys.uwm.edu/results.php?hostid=8143&offset=0&show_names=0&state=5&appid=29
Claggy
Date: Sat, 7 Jun 2014 10:51:16 +0100
From: [email protected]
To: [email protected]
Subject: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but
please read)
And bad form, with two separate issues to report. Sorry again.
1) Use of outlier detection to avoid skewed averages
2) Initial runtime estimates on the Android platform
1) Outlier detection.
This arises from the recent introduction of a new app_version at the LHCclassic
project. LHC, by its very nature, is searching for the onset of chaotic orbital
behaviour in the simulated particle beam: they expect, and actively want, many
tasks to finish early.
Eric Mcintosh commented in a recent 'lessons learned' news item -
http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=3838 - that
EXIT_TIME_LIMIT_EXCEEDED was his #1 problem following the new version release.
I've advised accordingly in that thread.
But I was surprised to find that outlier detection - an appropriate solution to
this particular case - wasn't documented in the developer Wiki: a trac/wiki
search only returns a single hit for 'outlier', and that's in
http://boinc.berkeley.edu/trac/wiki/ServerUpdates - which we seem to have
stopped updating. The one-line summary doesn't give much of a clue about when
and why this feature might be useful, and without a git translation the SVN
reference doesn't help either.
http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=e49f9459080b488f85fbcf8cdad6db9672416cf8
2) Android runtime estimates
The example here is from SIMAP. During a recent pause between batches, I
noticed that some of my 'pending validation' tasks were being slow to clear:
http://boincsimap.org/boincsimap/results.php?hostid=349248
The clearest example is the third of those three workunits:
http://boincsimap.org/boincsimap/workunit.php?wuid=57169928
Four of the seven replications have failed with 'Error while computing', and
every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an Android device.
Three of the four hosts have never returned a valid result (total credit zero),
so they have never had a chance to establish an APR for use in runtime
estimation: runtime estimates and bounds must have been generated by the server.
It seems - from these results, and others I've found pending on other machines
- that SIMAP tasks on Android are aborted with EXIT_TIME_LIMIT_EXCEEDED after
~6 hours elapsed. For the new batch released today, SIMAP are using a 3x bound
(which may be a bit low under the circumstances):
<rsc_fpops_est>13500000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound>
so I deduce that the tasks when first issued had a runtime estimate of ~2 hours.
My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 GFLOPS), take over
half an hour to complete: two hours for an ARM device sounds suspiciously low.
The only one of my Android wingmates to have registered an APR
(http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is
showing 1.69 GFLOPS, but I have no way of knowing whether that APR was
established before or after the task in question errored out.
From experience - borne out by current tests at Albert@Home, where server logs
are helpfully exposed to the public - initial server estimates can be
hopelessly over-optimistic. These two are for the same machine:
2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716] (BRP4G-cuda32-nv301)
adjusting projected flops based on PFC avg: 2124.60G
2014-06-07 09:30:56.1506 [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301)
setting projected flops based on host elapsed time avg: 23.71G
Since SIMAP have recently announced that they are leaving the BOINC platform at
the end of the year (despite being an Android launch partner with Samsung), I
doubt they'll want to put much effort into researching this issue.
But if other projects experimenting with Android applications are experiencing
a high task failure rate, they might like to check whether
EXIT_TIME_LIMIT_EXCEEDED is a significant factor in those failures, and if so,
consider the other remediation approaches (apart from outliers, which isn't
relevant in this case) that I suggested to Eric Mcintosh at LHC.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.