Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

David Anderson Tue, 10 Jun 2014 11:35:25 -0700

For credit purposes, the standard is peak FLOPS,
i.e. we give credit for what the device could do,
rather than what it actually did.
Among other things, this encourages projects to develop more efficient apps.


Currently we're not measuring this well for x86 CPUs,
since our Whetstone benchmark isn't optimized.
Ideally the BOINC client should include variants for the most common
CPU features, as we do for ARM.

-- D

On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote:

Before anybody leaps into making any changes on the basis of that observation, I
think we ought to pause and consider why we have a benchmark, and what we use 
it for.

I'd suggest that in an ideal world, we would be measuring the actual running 
speed
of (each project's) science applications on that particular host, optimisations 
and
all. We gradually do this through the runtime averages anyway, but it's hard to
gather a priori data on a new host.

Instead of (initially) measuring science application performance, we measure
hardware performance as a surrogate. We now have (at least) three ways of doing 
that:

x86: minimum, most conservative, estimate, no optimisations allowed for.
Android: allows for optimised hardware pathways with vfp or neon, but doesn't 
relate
back to science app capability.
GPU: maximum theoretical 'peak flops', calculated from card parameters, then 
scaled
back by rule of thumb.

Maybe we should standardise on just one standard?

    
------------------------------------------------------------------------------------
    *From:* Richard Haselgrove <[email protected]>
    *To:* Josef W. Segur <[email protected]>; David Anderson
    <[email protected]>
    *Cc:* "[email protected]" <[email protected]>
    *Sent:* Tuesday, 10 June 2014, 9:37
    *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, 
but
    please read)

    
http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea

     >________________________________
     > From: Josef W. Segur <[email protected] <mailto:[email protected]>>
     >To: David Anderson <[email protected] 
<mailto:[email protected]>>
     >Cc: "[email protected] <mailto:[email protected]>"
    <[email protected] <mailto:[email protected]>>; Eric J 
Korpela
    <[email protected] <mailto:[email protected]>>; Richard 
Haselgrove
    <[email protected] <mailto:[email protected]>>
     >Sent: Tuesday, 10 June 2014, 2:19
     >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, 
but
    please read)
     >
     >
     >Consider Richard's observation:
     >
     >>>     It appears that the Android Whetstone benchmark used in the BOINC
    client has
     >>>     separate code paths for ARM, vfp, and NEON processors: a vfp or 
NEON
    processor
     >>>     will report that it is significantly faster than a plain-vanilla 
ARM.
     >
     >If that is so, it distinctly differs from the x86 Whetstone which never 
uses
    SIMD, and is truly conservative as you would want for 3).
     >--
     >                               Joe
     >
     >
     >
     >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson <[email protected]
    <mailto:[email protected]>> wrote:
     >
     >> Eric:
     >>
     >> Yes, I suspect that's what's going on.
     >> Currently the logic for estimating job runtime
     >> (estimate_flops() in sched_version.cpp) is
     >> 1) if this (host, app version) has > 10 results, use (host, app version)
    statistics
     >> 2) if this app version has > 100 results, use app version statistics
     >> 3) else use a conservative estimate based on p_fpops.
     >>
     >> I'm not sure we should be doing 2) at all,
     >> since as you point out the first x100 or 1000 results for an app version
     >> will generally be from the fastest devices
     >> (and even in the steady state,
     >> app version statistics disproportionately reflect fast devices).
     >>
     >> I'll make this change.
     >>
     >> -- David
     >>
     >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:
     >>> I also don't have direct access to the server as well, so I'm mostly 
guessing.
     >>> Having separate benchmarks for neon and VFP means there's a broad 
bimodal
     >>> distribution for the benchmark results.  Where the mean falls depends 
upon
    the mix
     >>> of machines.  In general the neon machines (being newer and faster) 
will report
     >>> first and more often, so early on the PFC distribution will reflect 
the fast
     >>> machines.  Slower machines will be underweighted.  So the work will be
    estimated to
     >>> complete quickly, and some machines will time out.  In SETI beta, it
    resolves itself
     >>> in a few weeks.  I can't guarantee that it will anywhere else.
     >>>
     >>> We see this with every release of a GPU app.  The real capabilities of 
graphics
     >>> cards vary by orders of magnitude from the estimate and by more from 
each
    other.
     >>> The fast cards report first and most every else hits days of timeouts.
     >>>
     >>> One possible fix so to increase the timeout limits for the first 10
    workunits for a
     >>> host_app_version, until host based estimates take over.
     >>>
     >>>
     >>>
     >>>
     >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove
    <[email protected] <mailto:[email protected]>
     >>> <mailto:[email protected]
    <mailto:[email protected]>>> wrote:
     >>>
     >>>     I think Eric Korpela would be the best person to answer that 
question,
    but I
     >>>     suspect 'probably not': further investigation over the weekend 
suggests
    that the
     >>>     circumstances may be SIMAP-specific.
     >>>
     >>>     It appears that the Android Whetstone benchmark used in the BOINC
    client has
     >>>     separate code paths for ARM, vfp, and NEON processors: a vfp or 
NEON
    processor
     >>>     will report that it is significantly faster than a plain-vanilla 
ARM.
     >>>
     >>>     However, SIMAP have only deployed a single Android app, which I'm
    assuming only
     >>>     uses ARM functions: devices with vfp or NEON SIMD vectorisation
    available would
     >>>     run the non-optimised application much slower than BOINC expects.
     >>>
     >>>     At my suggestion, Thomas Rattei (SIMAP admistrator) increased the
     >>>     rsc_fpops_bound multiplier to 10x on Sunday afternoon. I note that 
the
    maximum
     >>>     runtime displayed on 
http://boincsimap.org/boincsimap/server_status.php has
     >>>     already increased from 11 hours to 14 hours since he did that.
     >>>
     >>>     Thomas has told me "We've seen that [EXIT_TIME_LIMIT_EXCEEDED] a 
lot.
    However,
     >>>     due to Samsung PowerSleep, we thought these are mainly "lazy" users
    just not
     >>>     using their phone regularly for computing." He's going to monitor 
how this
     >>>     progresses during the remainder of the current batch, and I've 
asked
    him to keep
     >>>     us updated on his observations.
     >>>
     >>>
     >>>
     >>>      >________________________________
     >>>      > From: David Anderson <[email protected]
    <mailto:[email protected]> <mailto:[email protected]
    <mailto:[email protected]>>>
     >>>      >To: [email protected] 
<mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
     >>>     >Sent: Monday, 9 June 2014, 3:48
     >>>      >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me
    again, but
     >>>     please read)
     >>>      >
     >>>      >
     >>>      >Does this problem occur on SETI@home?
     >>>      >-- David
     >>>      >
     >>>      >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote:
     >>>      >
     >>>      >> 2) Android runtime estimates
     >>>      >>
     >>>      >> The example here is from SIMAP. During a recent pause between
    batches, I noticed
     >>>      >> that some of my 'pending validation' tasks were being slow to 
clear:
     >>>      >> http://boincsimap.org/boincsimap/results.php?hostid=349248
     >>>      >>
     >>>      >> The clearest example is the third of those three workunits:
     >>>      >> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928
     >>>      >>
     >>>      >> Four of the seven replications have failed with 'Error while
    computing', and
     >>>      >> every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an
    Android device.
     >>>      >>
     >>>      >> Three of the four hosts have never returned a valid result 
(total
    credit zero),
     >>>      >> so they have never had a chance to establish an APR for use in 
runtime
     >>>      >> estimation: runtime estimates and bounds must have been 
generated
    by the server.
     >>>      >>
     >>>      >> It seems - from these results, and others I've found pending on
    other machines -
     >>>      >> that SIMAP tasks on Android are aborted with
    EXIT_TIME_LIMIT_EXCEEDED after ~6
     >>>      >> hours elapsed. For the new batch released today, SIMAP are 
using a
    3x bound
     >>>      >> (which may be a bit low under the circumstances):
     >>>      >>
     >>>      >> <rsc_fpops_est>13500000000000.000000</rsc_fpops_est>
     >>>     >> <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound>
     >>>      >>
     >>>      >> so I deduce that the tasks when first issued had a runtime 
estimate
    of ~2 hours.
     >>>      >>
     >>>      >> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 
GFLOPS),
    take over half
     >>>      >> an hour to complete: two hours for an ARM device sounds
    suspiciously low. The
     >>>      >> only one of my Android wingmates to have registered an APR
     >>>      >>
    (http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is
     >>>     showing
     >>>      >> 1.69 GFLOPS, but I have no way of knowing whether that APR was
    established
     >>>     before
     >>>      >> or after the task in question errored out.
     >>>      >>
     >>>      >> From experience - borne out by current tests at Albert@Home, 
where
    server logs
     >>>      >> are helpfully exposed to the public - initial server estimates 
can
    be hopelessly
     >>>      >> over-optimistic. These two are for the same machine:
     >>>      >>
     >>>      >> 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716]
    (BRP4G-cuda32-nv301)
     >>>      >> adjusting projected flops based on PFC avg: 2124.60G 2014-06-07
    09:30:56.1506
     >>>      >> [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301) setting
    projected flops
     >>>     based
     >>>      >> on host elapsed time avg: 23.71G
     >>>      >>
     >>>      >> Since SIMAP have recently announced that they are leaving the 
BOINC
    platform at
     >>>      >> the end of the year (despite being an Android launch partner 
with
    Samsung), I
     >>>      >> doubt they'll want to put much effort into researching this 
issue.
     >>>      >>
     >>>      >> But if other projects experimenting with Android applications 
are
    experiencing a
     >>>      >> high task failure rate, they might like to check whether
     >>>     EXIT_TIME_LIMIT_EXCEEDED
     >>>      >> is a significant factor in those failures, and if so, consider 
the
    other
     >>>      >> remediation approaches (apart from outliers, which isn't 
relevant
    in this case)
     >>>      >> that I suggested to Eric Mcintosh at LHC.
     >
     >
     >
    _______________________________________________
    boinc_dev mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    To unsubscribe, visit the above URL and
    (near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Reply via email to