This is an Einstein-specific issue
(at least, I can't debug it without looking at their server).
-- David

On 07-Jun-2014 3:44 AM, Stephen Maclagan wrote:
And my HD7770 is getting the following at Albert because it hasn't finished 
it's 11 validations for it's app_version yet:

2014-06-05 09:56:29.7913 [PID=7201 ]    [version] looking for version of 
einsteinbinary_BRP4G
  2014-06-05 09:56:29.7913 [PID=7201 ]    [version] Checking plan class 
'BRP4G-opencl-ati'
  2014-06-05 09:56:29.7913 [PID=7201 ]    [version] plan_class_spec: parsed 
project prefs setting 'gpu_util_brp' : true : 1.000000
  2014-06-05 09:56:29.7913 [PID=7201 ]    [version] [AV#721] (BRP4G-opencl-ati) 
adjusting projected flops based on PFC avg: 34968.78G
  2014-06-05 09:56:29.7913 [PID=7201 ]    [version] Best app version is now 
AV721 (18620.28 GFLOP)
  2014-06-05 09:56:29.7913 [PID=7201 ]    [version] [AV#721] (BRP4G-opencl-ati) 
adjusting projected flops based on PFC avg: 34968.78G
  2014-06-05 09:56:29.7914 [PID=7201 ]    [version] Best version of app 
einsteinbinary_BRP4G is [AV#721] (34968.78 GFLOPS)

  2014-06-05 09:56:29.7974 [PID=7201 ]    [send] Sending app_version 
einsteinbinary_BRP4G 7 134 BRP4G-opencl-ati; projected 34968.78 GFLOPS
  2014-06-05 09:56:29.7976 [PID=7201 ]    [send] est. duration for WU 606407: 
unscaled 8.01 scaled 10.96
  2014-06-05 09:56:29.7976 [PID=7201 ]    [send] [HOST#8143] sending 
[RESULT#1454943 p2030.20131124.G176.16-01.04.S.b2s0g0.00000_3616_1] (est. dur. 
10.96s (0h00m10s95)) (max time 160.14s (0h02m40s14))

Real duration is going to be something like an hour, and not the 11 seconds it 
expects it to be done in!!

https://albert.phys.uwm.edu/results.php?hostid=8143&offset=0&show_names=0&state=5&appid=29

Claggy



Date: Sat, 7 Jun 2014 10:51:16 +0100
From: [email protected]
To: [email protected]
Subject: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again,     but 
please read)

And bad form, with two separate issues to report. Sorry again.

1) Use of outlier detection to avoid skewed averages
2) Initial runtime estimates on the Android platform

1) Outlier detection.

This arises from the recent introduction of a new app_version at the LHCclassic 
project. LHC, by its very nature, is searching for the onset of chaotic orbital 
behaviour in the simulated particle beam: they expect, and actively want, many 
tasks to finish early.

Eric Mcintosh commented in a recent 'lessons learned' news item - 
http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=3838 - that 
EXIT_TIME_LIMIT_EXCEEDED was his #1 problem following the new version release. 
I've advised accordingly in that thread.

But I was surprised to find that outlier detection - an appropriate solution to 
this particular case - wasn't documented in the developer Wiki: a trac/wiki 
search only returns a single hit for 'outlier', and that's in 
http://boinc.berkeley.edu/trac/wiki/ServerUpdates - which we seem to have 
stopped updating. The one-line summary doesn't give much of a clue about when 
and why this feature might be useful, and without a git translation the SVN 
reference doesn't help either.

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=e49f9459080b488f85fbcf8cdad6db9672416cf8


2) Android runtime estimates

The example here is from SIMAP. During a recent pause between batches, I 
noticed that some of my 'pending validation' tasks were being slow to clear: 
http://boincsimap.org/boincsimap/results.php?hostid=349248

The clearest example is the third of those three workunits: 
http://boincsimap.org/boincsimap/workunit.php?wuid=57169928

Four of the seven replications have failed with 'Error while computing', and 
every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an Android device.

Three of the four hosts have never returned a valid result (total credit zero), 
so they have never had a chance to establish an APR for use in runtime 
estimation: runtime estimates and bounds must have been generated by the server.

It seems - from these results, and others I've found pending on other machines 
- that SIMAP tasks on Android are aborted with EXIT_TIME_LIMIT_EXCEEDED after 
~6 hours elapsed. For the new batch released today, SIMAP are using a 3x bound 
(which may be a bit low under the circumstances):

       <rsc_fpops_est>13500000000000.000000</rsc_fpops_est>
     <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound>

so I deduce that the tasks when first issued had a runtime estimate of ~2 hours.

My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 GFLOPS), take over 
half an hour to complete: two hours for an ARM device sounds suspiciously low. 
The only one of my Android wingmates to have registered an APR 
(http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is 
showing 1.69 GFLOPS, but I have no way of knowing whether that APR was 
established before or after the task in question errored out.

 From experience - borne out by current tests at Albert@Home, where server logs 
are helpfully exposed to the public - initial server estimates can be 
hopelessly over-optimistic. These two are for the same machine:

2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716] (BRP4G-cuda32-nv301) 
adjusting projected flops based on PFC avg: 2124.60G
2014-06-07 09:30:56.1506 [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301) 
setting projected flops based on host elapsed time avg: 23.71G

Since SIMAP have recently announced that they are leaving the BOINC platform at 
the end of the year (despite being an Android launch partner with Samsung), I 
doubt they'll want to put much effort into researching this issue.

But if other projects experimenting with Android applications are experiencing 
a high task failure rate, they might like to check whether 
EXIT_TIME_LIMIT_EXCEEDED is a significant factor in those failures, and if so, 
consider the other remediation approaches (apart from outliers, which isn't 
relevant in this case) that I suggested to Eric Mcintosh at LHC.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
                                        
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to