Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Josef W. Segur Mon, 09 Jun 2014 18:20:23 -0700

Consider Richard's observation:

    It appears that the Android Whetstone benchmark used in the BOINC client has
    separate code paths for ARM, vfp, and NEON processors: a vfp or NEON 
processor
    will report that it is significantly faster than a plain-vanilla ARM.


If that is so, it distinctly differs from the x86 Whetstone which never uses 
SIMD, and is truly conservative as you would want for 3).
--
                                                               Joe



On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson <[email protected]> 
wrote:

Eric:

Yes, I suspect that's what's going on.
Currently the logic for estimating job runtime
(estimate_flops() in sched_version.cpp) is
1) if this (host, app version) has > 10 results, use (host, app version) 
statistics
2) if this app version has > 100 results, use app version statistics
3) else use a conservative estimate based on p_fpops.

I'm not sure we should be doing 2) at all,
since as you point out the first x100 or 1000 results for an app version
will generally be from the fastest devices
(and even in the steady state,
app version statistics disproportionately reflect fast devices).

I'll make this change.

-- David

On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:

I also don't have direct access to the server as well, so I'm mostly guessing.
Having separate benchmarks for neon and VFP means there's a broad bimodal
distribution for the benchmark results.  Where the mean falls depends upon the 
mix
of machines.  In general the neon machines (being newer and faster) will report
first and more often, so early on the PFC distribution will reflect the fast
machines.  Slower machines will be underweighted.  So the work will be 
estimated to
complete quickly, and some machines will time out.  In SETI beta, it resolves 
itself
in a few weeks.  I can't guarantee that it will anywhere else.

We see this with every release of a GPU app.  The real capabilities of graphics
cards vary by orders of magnitude from the estimate and by more from each other.
The fast cards report first and most every else hits days of timeouts.

One possible fix so to increase the timeout limits for the first 10 workunits 
for a
host_app_version, until host based estimates take over.




On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove <[email protected]
<mailto:[email protected]>> wrote:

    I think Eric Korpela would be the best person to answer that question, but I
    suspect 'probably not': further investigation over the weekend suggests 
that the
    circumstances may be SIMAP-specific.

    It appears that the Android Whetstone benchmark used in the BOINC client has
    separate code paths for ARM, vfp, and NEON processors: a vfp or NEON 
processor
    will report that it is significantly faster than a plain-vanilla ARM.

    However, SIMAP have only deployed a single Android app, which I'm assuming 
only
    uses ARM functions: devices with vfp or NEON SIMD vectorisation available 
would
    run the non-optimised application much slower than BOINC expects.

    At my suggestion, Thomas Rattei (SIMAP admistrator) increased the
    rsc_fpops_bound multiplier to 10x on Sunday afternoon. I note that the 
maximum
    runtime displayed on http://boincsimap.org/boincsimap/server_status.php has
    already increased from 11 hours to 14 hours since he did that.

    Thomas has told me "We've seen that [EXIT_TIME_LIMIT_EXCEEDED] a lot. 
However,
    due to Samsung PowerSleep, we thought these are mainly "lazy" users just not
    using their phone regularly for computing." He's going to monitor how this
    progresses during the remainder of the current batch, and I've asked him to 
keep
    us updated on his observations.



     >________________________________
     > From: David Anderson <[email protected] 
<mailto:[email protected]>>
     >To: [email protected] <mailto:[email protected]>
     >Sent: Monday, 9 June 2014, 3:48
     >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, 
but
    please read)
     >
     >
     >Does this problem occur on SETI@home?
     >-- David
     >
     >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote:
     >
     >> 2) Android runtime estimates
     >>
     >> The example here is from SIMAP. During a recent pause between batches, 
I noticed
     >> that some of my 'pending validation' tasks were being slow to clear:
     >> http://boincsimap.org/boincsimap/results.php?hostid=349248
     >>
     >> The clearest example is the third of those three workunits:
     >> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928
     >>
     >> Four of the seven replications have failed with 'Error while 
computing', and
     >> every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an Android 
device.
     >>
     >> Three of the four hosts have never returned a valid result (total 
credit zero),
     >> so they have never had a chance to establish an APR for use in runtime
     >> estimation: runtime estimates and bounds must have been generated by 
the server.
     >>
     >> It seems - from these results, and others I've found pending on other 
machines -
     >> that SIMAP tasks on Android are aborted with EXIT_TIME_LIMIT_EXCEEDED 
after ~6
     >> hours elapsed. For the new batch released today, SIMAP are using a 3x 
bound
     >> (which may be a bit low under the circumstances):
     >>
     >> <rsc_fpops_est>13500000000000.000000</rsc_fpops_est>
     >> <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound>
     >>
     >> so I deduce that the tasks when first issued had a runtime estimate of 
~2 hours.
     >>
     >> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 GFLOPS), take 
over half
     >> an hour to complete: two hours for an ARM device sounds suspiciously 
low. The
     >> only one of my Android wingmates to have registered an APR
     >> (http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) 
is
    showing
     >> 1.69 GFLOPS, but I have no way of knowing whether that APR was 
established
    before
     >> or after the task in question errored out.
     >>
     >> From experience - borne out by current tests at Albert@Home, where 
server logs
     >> are helpfully exposed to the public - initial server estimates can be 
hopelessly
     >> over-optimistic. These two are for the same machine:
     >>
     >> 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716] 
(BRP4G-cuda32-nv301)
     >> adjusting projected flops based on PFC avg: 2124.60G 2014-06-07 
09:30:56.1506
     >> [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301) setting projected 
flops
    based
     >> on host elapsed time avg: 23.71G
     >>
     >> Since SIMAP have recently announced that they are leaving the BOINC 
platform at
     >> the end of the year (despite being an Android launch partner with 
Samsung), I
     >> doubt they'll want to put much effort into researching this issue.
     >>
     >> But if other projects experimenting with Android applications are 
experiencing a
     >> high task failure rate, they might like to check whether
    EXIT_TIME_LIMIT_EXCEEDED
     >> is a significant factor in those failures, and if so, consider the other
     >> remediation approaches (apart from outliers, which isn't relevant in 
this case)
     >> that I suggested to Eric Mcintosh at LHC.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Reply via email to