http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea




>________________________________
> From: Josef W. Segur <[email protected]>
>To: David Anderson <[email protected]> 
>Cc: "[email protected]" <[email protected]>; Eric J Korpela 
><[email protected]>; Richard Haselgrove <[email protected]> 
>Sent: Tuesday, 10 June 2014, 2:19
>Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but 
>please read)
> 
>
>Consider Richard's observation:
>
>>>     It appears that the Android Whetstone benchmark used in the BOINC 
>>>client has
>>>     separate code paths for ARM, vfp, and NEON processors: a vfp or NEON 
>>>processor
>>>     will report that it is significantly faster than a plain-vanilla ARM.
>
>If that is so, it distinctly differs from the x86 Whetstone which never uses 
>SIMD, and is truly conservative as you would want for 3).
>-- 
>                                                                Joe
>
>
>
>On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson <[email protected]> 
>wrote:
>
>> Eric:
>>
>> Yes, I suspect that's what's going on.
>> Currently the logic for estimating job runtime
>> (estimate_flops() in sched_version.cpp) is
>> 1) if this (host, app version) has > 10 results, use (host, app version) 
>> statistics
>> 2) if this app version has > 100 results, use app version statistics
>> 3) else use a conservative estimate based on p_fpops.
>>
>> I'm not sure we should be doing 2) at all,
>> since as you point out the first x100 or 1000 results for an app version
>> will generally be from the fastest devices
>> (and even in the steady state,
>> app version statistics disproportionately reflect fast devices).
>>
>> I'll make this change.
>>
>> -- David
>>
>> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:
>>> I also don't have direct access to the server as well, so I'm mostly 
>>> guessing.
>>> Having separate benchmarks for neon and VFP means there's a broad bimodal
>>> distribution for the benchmark results.  Where the mean falls depends upon 
>>> the mix
>>> of machines.  In general the neon machines (being newer and faster) will 
>>> report
>>> first and more often, so early on the PFC distribution will reflect the fast
>>> machines.  Slower machines will be underweighted.  So the work will be 
>>> estimated to
>>> complete quickly, and some machines will time out.  In SETI beta, it 
>>> resolves itself
>>> in a few weeks.  I can't guarantee that it will anywhere else.
>>>
>>> We see this with every release of a GPU app.  The real capabilities of 
>>> graphics
>>> cards vary by orders of magnitude from the estimate and by more from each 
>>> other.
>>> The fast cards report first and most every else hits days of timeouts.
>>>
>>> One possible fix so to increase the timeout limits for the first 10 
>>> workunits for a
>>> host_app_version, until host based estimates take over.
>>>
>>>
>>>
>>>
>>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove 
>>> <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     I think Eric Korpela would be the best person to answer that question, 
>>>but I
>>>     suspect 'probably not': further investigation over the weekend suggests 
>>>that the
>>>     circumstances may be SIMAP-specific.
>>>
>>>     It appears that the Android Whetstone benchmark used in the BOINC 
>>>client has
>>>     separate code paths for ARM, vfp, and NEON processors: a vfp or NEON 
>>>processor
>>>     will report that it is significantly faster than a plain-vanilla ARM.
>>>
>>>     However, SIMAP have only deployed a single Android app, which I'm 
>>>assuming only
>>>     uses ARM functions: devices with vfp or NEON SIMD vectorisation 
>>>available would
>>>     run the non-optimised application much slower than BOINC expects.
>>>
>>>     At my suggestion, Thomas Rattei (SIMAP admistrator) increased the
>>>     rsc_fpops_bound multiplier to 10x on Sunday afternoon. I note that the 
>>>maximum
>>>     runtime displayed on http://boincsimap.org/boincsimap/server_status.php 
>>>has
>>>     already increased from 11 hours to 14 hours since he did that.
>>>
>>>     Thomas has told me "We've seen that [EXIT_TIME_LIMIT_EXCEEDED] a lot. 
>>>However,
>>>     due to Samsung PowerSleep, we thought these are mainly "lazy" users 
>>>just not
>>>     using their phone regularly for computing." He's going to monitor how 
>>>this
>>>     progresses during the remainder of the current batch, and I've asked 
>>>him to keep
>>>     us updated on his observations.
>>>
>>>
>>>
>>>      >________________________________
>>>      > From: David Anderson <[email protected] 
>>><mailto:[email protected]>>
>>>      >To: [email protected] <mailto:[email protected]>
>>>      >Sent: Monday, 9 June 2014, 3:48
>>>      >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me 
>>>again, but
>>>     please read)
>>>      >
>>>      >
>>>      >Does this problem occur on SETI@home?
>>>      >-- David
>>>      >
>>>      >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote:
>>>      >
>>>      >> 2) Android runtime estimates
>>>      >>
>>>      >> The example here is from SIMAP. During a recent pause between 
>>>batches, I noticed
>>>      >> that some of my 'pending validation' tasks were being slow to clear:
>>>      >> http://boincsimap.org/boincsimap/results.php?hostid=349248
>>>      >>
>>>      >> The clearest example is the third of those three workunits:
>>>      >> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928
>>>      >>
>>>      >> Four of the seven replications have failed with 'Error while 
>>>computing', and
>>>      >> every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an 
>>>Android device.
>>>      >>
>>>      >> Three of the four hosts have never returned a valid result (total 
>>>credit zero),
>>>      >> so they have never had a chance to establish an APR for use in 
>>>runtime
>>>      >> estimation: runtime estimates and bounds must have been generated 
>>>by the server.
>>>      >>
>>>      >> It seems - from these results, and others I've found pending on 
>>>other machines -
>>>      >> that SIMAP tasks on Android are aborted with 
>>>EXIT_TIME_LIMIT_EXCEEDED after ~6
>>>      >> hours elapsed. For the new batch released today, SIMAP are using a 
>>>3x bound
>>>      >> (which may be a bit low under the circumstances):
>>>      >>
>>>      >> <rsc_fpops_est>13500000000000.000000</rsc_fpops_est>
>>>      >> <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound>
>>>      >>
>>>      >> so I deduce that the tasks when first issued had a runtime estimate 
>>>of ~2 hours.
>>>      >>
>>>      >> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 GFLOPS), 
>>>take over half
>>>      >> an hour to complete: two hours for an ARM device sounds 
>>>suspiciously low. The
>>>      >> only one of my Android wingmates to have registered an APR
>>>      >> 
>>>(http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is
>>>     showing
>>>      >> 1.69 GFLOPS, but I have no way of knowing whether that APR was 
>>>established
>>>     before
>>>      >> or after the task in question errored out.
>>>      >>
>>>      >> From experience - borne out by current tests at Albert@Home, where 
>>>server logs
>>>      >> are helpfully exposed to the public - initial server estimates can 
>>>be hopelessly
>>>      >> over-optimistic. These two are for the same machine:
>>>      >>
>>>      >> 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716] 
>>>(BRP4G-cuda32-nv301)
>>>      >> adjusting projected flops based on PFC avg: 2124.60G 2014-06-07 
>>>09:30:56.1506
>>>      >> [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301) setting 
>>>projected flops
>>>     based
>>>      >> on host elapsed time avg: 23.71G
>>>      >>
>>>      >> Since SIMAP have recently announced that they are leaving the BOINC 
>>>platform at
>>>      >> the end of the year (despite being an Android launch partner with 
>>>Samsung), I
>>>      >> doubt they'll want to put much effort into researching this issue.
>>>      >>
>>>      >> But if other projects experimenting with Android applications are 
>>>experiencing a
>>>      >> high task failure rate, they might like to check whether
>>>     EXIT_TIME_LIMIT_EXCEEDED
>>>      >> is a significant factor in those failures, and if so, consider the 
>>>other
>>>      >> remediation approaches (apart from outliers, which isn't relevant 
>>>in this case)
>>>      >> that I suggested to Eric Mcintosh at LHC.
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to