Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

McLeod, John Wed, 11 Jun 2014 09:14:38 -0700

A possibility that will work itself out eventually:

If a machine has exceeded its computation bound for an application version / 
plan class for all tasks so far, then the next task for that application 
version / plan class gets its computation bound multiplied by 2 for the next 
task for that application version / plan class - until it succeeds in actually 
completing one.  That one would be used as the baseline average computation 
length for that application version / plan class.


-----Original Message-----
From: boinc_dev [mailto:[email protected]] On Behalf Of Josef 
W. Segur
Sent: Wednesday, June 11, 2014 11:38 AM
To: David Anderson
Cc: [email protected]; Richard Haselgrove
Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but 
please read)

A couple of ideas specifically related to EXIT_TIME_LIMIT_EXCEEDED:

1. The BOINC x86 benchmark based on the Whetstone test has typically yielded a 
value which is roughly 90% of the CPU clock rate. In terms of a conservative 
estimate to be used prior to having usable statistics about how the CPU 
performs on a specific app version, it may be sensible to use clock rate rather 
than p_fpops. That would not take into account the possibility that hardware or 
OS may change the clock rate, but neither does the benchmark.

2. When the scheduler assigns a task to a host, it could multiply the 
rsc_fpops_bound by 5 or 10 if the host does not yet have sufficient results for 
the app version.
-- 
                                                              Joe



On Tue, 10 Jun 2014 14:34:57 -0400, David Anderson <[email protected]> 
wrote:

> For credit purposes, the standard is peak FLOPS,
> i.e. we give credit for what the device could do,
> rather than what it actually did.
> Among other things, this encourages projects to develop more efficient apps.
>
> Currently we're not measuring this well for x86 CPUs,
> since our Whetstone benchmark isn't optimized.
> Ideally the BOINC client should include variants for the most common
> CPU features, as we do for ARM.
>
> -- D
>
> On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote:
>> Before anybody leaps into making any changes on the basis of that 
>> observation, I
>> think we ought to pause and consider why we have a benchmark, and what we 
>> use it for.
>>
>> I'd suggest that in an ideal world, we would be measuring the actual running 
>> speed
>> of (each project's) science applications on that particular host, 
>> optimisations and
>> all. We gradually do this through the runtime averages anyway, but it's hard 
>> to
>> gather a priori data on a new host.
>>
>> Instead of (initially) measuring science application performance, we measure
>> hardware performance as a surrogate. We now have (at least) three ways of 
>> doing that:
>>
>> x86: minimum, most conservative, estimate, no optimisations allowed for.
>> Android: allows for optimised hardware pathways with vfp or neon, but 
>> doesn't relate
>> back to science app capability.
>> GPU: maximum theoretical 'peak flops', calculated from card parameters, then 
>> scaled
>> back by rule of thumb.
>>
>> Maybe we should standardise on just one standard?
>>
>>     
>> ------------------------------------------------------------------------------------
>>     *From:* Richard Haselgrove <[email protected]>
>>     *To:* Josef W. Segur <[email protected]>; David Anderson
>>     <[email protected]>
>>     *Cc:* "[email protected]" <[email protected]>
>>     *Sent:* Tuesday, 10 June 2014, 9:37
>>     *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me 
>> again, but
>>     please read)
>>
>>     
>> http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea
>>
>>      >________________________________
>>      > From: Josef W. Segur <[email protected] 
>> <mailto:[email protected]>>
>>      >To: David Anderson <[email protected] 
>> <mailto:[email protected]>>
>>      >Cc: "[email protected] <mailto:[email protected]>"
>>     <[email protected] <mailto:[email protected]>>; Eric J 
>> Korpela
>>     <[email protected] <mailto:[email protected]>>; Richard 
>> Haselgrove
>>     <[email protected] <mailto:[email protected]>>
>>      >Sent: Tuesday, 10 June 2014, 2:19
>>      >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me 
>> again, but
>>     please read)
>>      >
>>      >
>>      >Consider Richard's observation:
>>      >
>>      >>>     It appears that the Android Whetstone benchmark used in the 
>> BOINC
>>     client has
>>      >>>     separate code paths for ARM, vfp, and NEON processors: a vfp or 
>> NEON
>>     processor
>>      >>>     will report that it is significantly faster than a 
>> plain-vanilla ARM.
>>      >
>>      >If that is so, it distinctly differs from the x86 Whetstone which 
>> never uses
>>     SIMD, and is truly conservative as you would want for 3).
>>      >--
>>      >                               Joe
>>      >
>>      >
>>      >
>>      >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson 
>> <[email protected]
>>     <mailto:[email protected]>> wrote:
>>      >
>>      >> Eric:
>>      >>
>>      >> Yes, I suspect that's what's going on.
>>      >> Currently the logic for estimating job runtime
>>      >> (estimate_flops() in sched_version.cpp) is
>>      >> 1) if this (host, app version) has > 10 results, use (host, app 
>> version)
>>     statistics
>>      >> 2) if this app version has > 100 results, use app version statistics
>>      >> 3) else use a conservative estimate based on p_fpops.
>>      >>
>>      >> I'm not sure we should be doing 2) at all,
>>      >> since as you point out the first x100 or 1000 results for an app 
>> version
>>      >> will generally be from the fastest devices
>>      >> (and even in the steady state,
>>      >> app version statistics disproportionately reflect fast devices).
>>      >>
>>      >> I'll make this change.
>>      >>
>>      >> -- David
>>      >>
>>      >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:
>>      >>> I also don't have direct access to the server as well, so I'm 
>> mostly guessing.
>>      >>> Having separate benchmarks for neon and VFP means there's a broad 
>> bimodal
>>      >>> distribution for the benchmark results.  Where the mean falls 
>> depends upon
>>     the mix
>>      >>> of machines.  In general the neon machines (being newer and faster) 
>> will report
>>      >>> first and more often, so early on the PFC distribution will reflect 
>> the fast
>>      >>> machines.  Slower machines will be underweighted.  So the work will 
>> be
>>     estimated to
>>      >>> complete quickly, and some machines will time out.  In SETI beta, it
>>     resolves itself
>>      >>> in a few weeks.  I can't guarantee that it will anywhere else.
>>      >>>
>>      >>> We see this with every release of a GPU app.  The real capabilities 
>> of graphics
>>      >>> cards vary by orders of magnitude from the estimate and by more 
>> from each
>>     other.
>>      >>> The fast cards report first and most every else hits days of 
>> timeouts.
>>      >>>
>>      >>> One possible fix so to increase the timeout limits for the first 10
>>     workunits for a
>>      >>> host_app_version, until host based estimates take over.
>>      >>>
>>      >>>
>>      >>>
>>      >>>
>>      >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove
>>     <[email protected] <mailto:[email protected]>
>>      >>> <mailto:[email protected]
>>     <mailto:[email protected]>>> wrote:
>>      >>>
>>      >>>     I think Eric Korpela would be the best person to answer that 
>> question,
>>     but I
>>      >>>     suspect 'probably not': further investigation over the weekend 
>> suggests
>>     that the
>>      >>>     circumstances may be SIMAP-specific.
>>      >>>
>>      >>>     It appears that the Android Whetstone benchmark used in the 
>> BOINC
>>     client has
>>      >>>     separate code paths for ARM, vfp, and NEON processors: a vfp or 
>> NEON
>>     processor
>>      >>>     will report that it is significantly faster than a 
>> plain-vanilla ARM.
>>      >>>
>>      >>>     However, SIMAP have only deployed a single Android app, which 
>> I'm
>>     assuming only
>>      >>>     uses ARM functions: devices with vfp or NEON SIMD vectorisation
>>     available would
>>      >>>     run the non-optimised application much slower than BOINC 
>> expects.
>>      >>>
>>      >>>     At my suggestion, Thomas Rattei (SIMAP admistrator) increased 
>> the
>>      >>>     rsc_fpops_bound multiplier to 10x on Sunday afternoon. I note 
>> that the
>>     maximum
>>      >>>     runtime displayed on 
>> http://boincsimap.org/boincsimap/server_status.php has
>>      >>>     already increased from 11 hours to 14 hours since he did that.
>>      >>>
>>      >>>     Thomas has told me "We've seen that [EXIT_TIME_LIMIT_EXCEEDED] 
>> a lot.
>>     However,
>>      >>>     due to Samsung PowerSleep, we thought these are mainly "lazy" 
>> users
>>     just not
>>      >>>     using their phone regularly for computing." He's going to 
>> monitor how this
>>      >>>     progresses during the remainder of the current batch, and I've 
>> asked
>>     him to keep
>>      >>>     us updated on his observations.
>>      >>>
>>      >>>
>>      >>>
>>      >>>      >________________________________
>>      >>>      > From: David Anderson <[email protected]
>>     <mailto:[email protected]> <mailto:[email protected]
>>     <mailto:[email protected]>>>
>>      >>>      >To: [email protected] 
>> <mailto:[email protected]>
>>     <mailto:[email protected] <mailto:[email protected]>>
>>      >>>     >Sent: Monday, 9 June 2014, 3:48
>>      >>>      >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes 
>> me
>>     again, but
>>      >>>     please read)
>>      >>>      >
>>      >>>      >
>>      >>>      >Does this problem occur on SETI@home?
>>      >>>      >-- David
>>      >>>      >
>>      >>>      >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote:
>>      >>>      >
>>      >>>      >> 2) Android runtime estimates
>>      >>>      >>
>>      >>>      >> The example here is from SIMAP. During a recent pause 
>> between
>>     batches, I noticed
>>      >>>      >> that some of my 'pending validation' tasks were being slow 
>> to clear:
>>      >>>      >> http://boincsimap.org/boincsimap/results.php?hostid=349248
>>      >>>      >>
>>      >>>      >> The clearest example is the third of those three workunits:
>>      >>>      >> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928
>>      >>>      >>
>>      >>>      >> Four of the seven replications have failed with 'Error while
>>     computing', and
>>      >>>      >> every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on an
>>     Android device.
>>      >>>      >>
>>      >>>      >> Three of the four hosts have never returned a valid result 
>> (total
>>     credit zero),
>>      >>>      >> so they have never had a chance to establish an APR for use 
>> in runtime
>>      >>>      >> estimation: runtime estimates and bounds must have been 
>> generated
>>     by the server.
>>      >>>      >>
>>      >>>      >> It seems - from these results, and others I've found 
>> pending on
>>     other machines -
>>      >>>      >> that SIMAP tasks on Android are aborted with
>>     EXIT_TIME_LIMIT_EXCEEDED after ~6
>>      >>>      >> hours elapsed. For the new batch released today, SIMAP are 
>> using a
>>     3x bound
>>      >>>      >> (which may be a bit low under the circumstances):
>>      >>>      >>
>>      >>>      >> <rsc_fpops_est>13500000000000.000000</rsc_fpops_est>
>>      >>>     >> <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound>
>>      >>>      >>
>>      >>>      >> so I deduce that the tasks when first issued had a runtime 
>> estimate
>>     of ~2 hours.
>>      >>>      >>
>>      >>>      >> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34 
>> GFLOPS),
>>     take over half
>>      >>>      >> an hour to complete: two hours for an ARM device sounds
>>     suspiciously low. The
>>      >>>      >> only one of my Android wingmates to have registered an APR
>>      >>>      >>
>>     (http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033) is
>>      >>>     showing
>>      >>>      >> 1.69 GFLOPS, but I have no way of knowing whether that APR 
>> was
>>     established
>>      >>>     before
>>      >>>      >> or after the task in question errored out.
>>      >>>      >>
>>      >>>      >> From experience - borne out by current tests at 
>> Albert@Home, where
>>     server logs
>>      >>>      >> are helpfully exposed to the public - initial server 
>> estimates can
>>     be hopelessly
>>      >>>      >> over-optimistic. These two are for the same machine:
>>      >>>      >>
>>      >>>      >> 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716]
>>     (BRP4G-cuda32-nv301)
>>      >>>      >> adjusting projected flops based on PFC avg: 2124.60G 
>> 2014-06-07
>>     09:30:56.1506
>>      >>>      >> [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301) setting
>>     projected flops
>>      >>>     based
>>      >>>      >> on host elapsed time avg: 23.71G
>>      >>>      >>
>>      >>>      >> Since SIMAP have recently announced that they are leaving 
>> the BOINC
>>     platform at
>>      >>>      >> the end of the year (despite being an Android launch 
>> partner with
>>     Samsung), I
>>      >>>      >> doubt they'll want to put much effort into researching this 
>> issue.
>>      >>>      >>
>>      >>>      >> But if other projects experimenting with Android 
>> applications are
>>     experiencing a
>>      >>>      >> high task failure rate, they might like to check whether
>>      >>>     EXIT_TIME_LIMIT_EXCEEDED
>>      >>>      >> is a significant factor in those failures, and if so, 
>> consider the
>>     other
>>      >>>      >> remediation approaches (apart from outliers, which isn't 
>> relevant
>>     in this case)
>>      >>>      >> that I suggested to Eric Mcintosh at LHC.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Reply via email to