The 'projected flops based on PFC avg' reports come from the 'Albert' test 
server, which has recently been updated (within the last 7 days) to the current 
BOINC master server code. Germany has a public holiday today, but I'm sure 
Bernd will help facilitate investigations into this issue when time permits.



>________________________________
> From: David Anderson <[email protected]>
>To: [email protected] 
>Sent: Monday, 9 June 2014, 2:54
>Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but 
>please read)
> 
>
>This is an Einstein-specific issue
>(at least, I can't debug it without looking at their server).
>-- David
>
>On 07-Jun-2014 3:44 AM, Stephen Maclagan wrote:
>> And my HD7770 is getting the following at Albert because it hasn't finished 
>> it's 11 validations for it's app_version yet:
>>
>> 2014-06-05 09:56:29.7913 [PID=7201 ]    [version] looking for version of 
>> einsteinbinary_BRP4G
>>   2014-06-05 09:56:29.7913 [PID=7201 ]    [version] Checking plan class 
>>'BRP4G-opencl-ati'
>>   2014-06-05 09:56:29.7913 [PID=7201 ]    [version] plan_class_spec: parsed 
>>project prefs setting 'gpu_util_brp' : true : 1.000000
>>   2014-06-05 09:56:29.7913 [PID=7201 ]    [version] [AV#721] 
>>(BRP4G-opencl-ati) adjusting projected flops based on PFC avg: 34968.78G
>>   2014-06-05 09:56:29.7913 [PID=7201 ]    [version] Best app version is now 
>>AV721 (18620.28 GFLOP)
>>   2014-06-05 09:56:29.7913 [PID=7201 ]    [version] [AV#721] 
>>(BRP4G-opencl-ati) adjusting projected flops based on PFC avg: 34968.78G
>>   2014-06-05 09:56:29.7914 [PID=7201 ]    [version] Best version of app 
>>einsteinbinary_BRP4G is [AV#721] (34968.78 GFLOPS)
>>
>>   2014-06-05 09:56:29.7974 [PID=7201 ]    [send] Sending app_version 
>>einsteinbinary_BRP4G 7 134 BRP4G-opencl-ati; projected 34968.78 GFLOPS
>>   2014-06-05 09:56:29.7976 [PID=7201 ]    [send] est. duration for WU 
>>606407: unscaled 8.01 scaled 10.96
>>   2014-06-05 09:56:29.7976 [PID=7201 ]    [send] [HOST#8143] sending 
>>[RESULT#1454943 p2030.20131124.G176.16-01.04.S.b2s0g0.00000_3616_1] (est. 
>>dur. 10.96s (0h00m10s95)) (max time 160.14s (0h02m40s14))
>>
>> Real duration is going to be something like an hour, and not the 11 seconds 
>> it expects it to be done in!!
>>
>> https://albert.phys.uwm.edu/results.php?hostid=8143&offset=0&show_names=0&state=5&appid=29
>>
>> Claggy
>>
>>
>>
>>> Date: Sat, 7 Jun 2014 10:51:16 +0100
>>> From: [email protected]
>>> To: [email protected]
>>> Subject: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again,    but 
>>> please read)
>>>
>>> And bad form, with two separate issues to report. Sorry again.
>>>
>>> 1) Use of outlier detection to avoid skewed averages
>>> 2) Initial runtime estimates on the Android platform
>>>
>>> 1) Outlier detection.
>>>
>>> This arises from the recent introduction of a new app_version at the 
>>> LHCclassic project. LHC, by its very nature, is searching for the onset of 
>>> chaotic orbital behaviour in the simulated particle beam: they expect, and 
>>> actively want, many tasks to finish early.
>>>
>>> Eric Mcintosh commented in a recent 'lessons learned' news item - 
>>> http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=3838 - that 
>>> EXIT_TIME_LIMIT_EXCEEDED was his #1 problem following the new version 
>>> release. I've advised accordingly in that thread.
>>>
>>> But I was surprised to find that outlier detection - an appropriate 
>>> solution to this particular case - wasn't documented in the developer Wiki: 
>>> a trac/wiki search only returns a single hit for 'outlier', and that's in 
>>> http://boinc.berkeley.edu/trac/wiki/ServerUpdates - which we seem to have 
>>> stopped updating. The one-line summary doesn't give much of a clue about 
>>> when and why this feature might be useful, and without a git translation 
>>> the SVN reference doesn't help either.
>>>
>>> http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=e49f9459080b488f85fbcf8cdad6db9672416cf8
>>>
>>>
>>> 2) Android runtime estimates
>>>
>>> The example here is from SIMAP. During a recent pause between batches, I 
>>> noticed that some of my 'pending validation' tasks were being slow to 
>>> clear: http://boincsimap.org/boincsimap/results.php?hostid=349248
>>>
>>> The clearest example is the third of those three workunits: 
>>> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928
>>>
>>> Four of the seven replications have failed with 'Error while computing', 
>>> and every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an Android 
>>> device.
>>>
>>> Three of the four hosts have never returned a valid result (total credit 
>>> zero), so they have never had a chance to establish an APR for use in 
>>> runtime estimation: runtime estimates and bounds must have been generated 
>>> by the server.
>>>
>>> It seems - from these results, and others I've found pending on other 
>>> machines - that SIMAP tasks on Android are aborted with 
>>> EXIT_TIME_LIMIT_EXCEEDED after ~6 hours elapsed. For the new batch released 
>>> today, SIMAP are using a 3x bound (which may be a bit low under the 
>>> circumstances):
>>>
>>>        <rsc_fpops_est>13500000000000.000000</rsc_fpops_est>
>>>      <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound>
>>>
>>> so I deduce that the tasks when first issued had a runtime estimate of ~2 
>>> hours.
>>>
>>> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 GFLOPS), take over 
>>> half an hour to complete: two hours for an ARM device sounds suspiciously 
>>> low. The only one of my Android wingmates to have registered an APR 
>>> (http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is 
>>> showing 1.69 GFLOPS, but I have no way of knowing whether that APR was 
>>> established before or after the task in question errored out.
>>>
>>>  From experience - borne out by current tests at Albert@Home, where server 
>>>logs are helpfully exposed to the public - initial server estimates can be 
>>>hopelessly over-optimistic. These two are for the same machine:
>>>
>>> 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716] 
>>> (BRP4G-cuda32-nv301) adjusting projected flops based on PFC avg: 2124.60G
>>> 2014-06-07 09:30:56.1506 [PID=10808] [version] [AV#716] 
>>> (BRP4G-cuda32-nv301) setting projected flops based on host elapsed time 
>>> avg: 23.71G
>>>
>>> Since SIMAP have recently announced that they are leaving the BOINC 
>>> platform at the end of the year (despite being an Android launch partner 
>>> with Samsung), I doubt they'll want to put much effort into researching 
>>> this issue.
>>>
>>> But if other projects experimenting with Android applications are 
>>> experiencing a high task failure rate, they might like to check whether 
>>> EXIT_TIME_LIMIT_EXCEEDED is a significant factor in those failures, and if 
>>> so, consider the other remediation approaches (apart from outliers, which 
>>> isn't relevant in this case) that I suggested to Eric Mcintosh at LHC.
>>> _______________________________________________
>>> boinc_dev mailing list
>>> [email protected]
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>>                           
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
>>
>_______________________________________________
>boinc_dev mailing list
>[email protected]
>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>To unsubscribe, visit the above URL and
>(near bottom of page) enter your email address.
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to