Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Richard Haselgrove Wed, 11 Jun 2014 06:55:41 -0700

Backtracking from user 'CElliot' at SETI Beta, I find user 'BlackLuke' at SETI.


Are these the tasks you're talking about?

http://setiathome.berkeley.edu/results.php?hostid=6643192&state=6


Although there is undoubtedly a problem, I can't agree with your analysis of 
how it comes about.

I've seen that
SETI@home error -6 Bad workunit header
!swi.data_type || !found || !swi.nsamples
File: ../../seti_header.cpp
Line: 210
on my own machines using the same third-party (Anonymous Platform) NV cuda 
application. It occurs (rarely) when a GPU task is resumed from 
suspension/preemption: my belief is that it's actually caused by problems in 
the checkpoint files, although the developer concerned hasn't confirmed that.

If my guess is correct, it would actually be less likely to happen during 
benchmarking: that's the one case where GPU apps are simply suspended, but not 
removed from memory and hence not restarted from a checkpoint.

If you have more evidence of the chain of events, please let us know: but I 
suspect the answer will lie in the project application code, rather than 
BOINC's benchmarking.

>________________________________
> From: Charles Elliott <[email protected]>
>To: 'Richard Haselgrove' <[email protected]> 
>Cc: [email protected] 
>Sent: Wednesday, 11 June 2014, 14:28
>Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but 
>please read)
> 
>
>With respect to the CPU benchmarks, there has recently been a huge
>improvement: Just a few months ago, one could achieve much better benchmark
>scores by clicking on snooze from the Boinc tray icon before running the
>benchmark with Boinccmd.  Now the results are the same whether one clicks on
>snooze or not.
>
>Now if only we could make the benchmarks run 3 or more times with only one
>work unit suspension and average the results using Boinccmd.  If the
>benchmarks are invoked from the command line the stderr.txt file becomes
>larger with all the duplicate header information.  stderr.txt files and
>their associated work unit results are rejected by the server if the
>stderr.txt file is deemed too large.  Average results of multiple runs of a
>benchmark are often a better indicator of actual performance than a single
>run because of the unpredictability of other tasks and O/S excursions.
>
>Charles Elliott
>
>> -----Original Message-----
>> From: boinc_dev [mailto:[email protected]] On Behalf
>> Of Richard Haselgrove
>> Sent: Tuesday, June 10, 2014 5:09 AM
>> To: Richard Haselgrove; Josef W. Segur; David Anderson
>> Cc: [email protected]
>> Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again,
>> but please read)
>> 
>> Before anybody leaps into making any changes on the basis of that
>> observation, I think we ought to pause and consider why we have a
>> benchmark, and what we use it for.
>> 
>> I'd suggest that in an ideal world, we would be measuring the actual
>> running speed of (each project's) science applications on that
>> particular host, optimisations and all. We gradually do this through
>> the runtime averages anyway, but it's hard to gather a priori data on a
>> new host.
>> 
>> Instead of (initially) measuring science application performance, we
>> measure hardware performance as a surrogate. We now have (at least)
>> three ways of doing that:
>> 
>> x86: minimum, most conservative, estimate, no optimisations allowed
>> for.
>> Android: allows for optimised hardware pathways with vfp or neon, but
>> doesn't relate back to science app capability.
>> GPU: maximum theoretical 'peak flops', calculated from card parameters,
>> then scaled back by rule of thumb.
>> 
>> Maybe we should standardise on just one standard?
>> 
>> 
>> 
>> >________________________________
>> > From: Richard Haselgrove <[email protected]>
>> >To: Josef W. Segur <[email protected]>; David Anderson
>> <[email protected]>
>> >Cc: "[email protected]" <[email protected]>
>> >Sent: Tuesday, 10 June 2014, 9:37
>> >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me
>> again, but please read)
>> >
>> >
>> >http://boinc.berkeley.edu/gitweb/?p=boinc-
>> v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea
>> >
>> >>________________________________
>> >> From: Josef W. Segur <[email protected]>
>> >>To: David Anderson <[email protected]>
>> >>Cc: "[email protected]" <[email protected]>; Eric J
>> Korpela <[email protected]>; Richard Haselgrove
>> <[email protected]>
>> >>Sent: Tuesday, 10 June 2014, 2:19
>> >>Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me
>> again, but please read)
>> >>
>> >>
>> >>Consider Richard's observation:
>> >>
>> >>>>     It appears that the Android Whetstone benchmark used in the
>> BOINC client has
>> >>>>     separate code paths for ARM, vfp, and NEON processors: a vfp
>> or NEON processor
>> >>>>     will report that it is significantly faster than a plain-
>> vanilla ARM.
>> >>
>> >>If that is so, it distinctly differs from the x86 Whetstone which
>> never uses SIMD, and is truly conservative as you would want for 3).
>> >>--
>> >>                                                                Joe
>> >>
>> >>
>> >>
>> >>On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson
>> <[email protected]> wrote:
>> >>
>> >>> Eric:
>> >>>
>> >>> Yes, I suspect that's what's going on.
>> >>> Currently the logic for estimating job runtime
>> >>> (estimate_flops() in sched_version.cpp) is
>> >>> 1) if this (host, app version) has > 10 results, use (host, app
>> version) statistics
>> >>> 2) if this app version has > 100 results, use app version
>> statistics
>> >>> 3) else use a conservative estimate based on p_fpops.
>> >>>
>> >>> I'm not sure we should be doing 2) at all,
>> >>> since as you point out the first x100 or 1000 results for an app
>> version
>> >>> will generally be from the fastest devices
>> >>> (and even in the steady state,
>> >>> app version statistics disproportionately reflect fast devices).
>> >>>
>> >>> I'll make this change.
>> >>>
>> >>> -- David
>> >>>
>> >>> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:
>> >>>> I also don't have direct access to the server as well, so I'm
>> mostly guessing.
>> >>>> Having separate benchmarks for neon and VFP means there's a broad
>> bimodal
>> >>>> distribution for the benchmark results.  Where the mean falls
>> depends upon the mix
>> >>>> of machines.  In general the neon machines (being newer and
>> faster) will report
>> >>>> first and more often, so early on the PFC distribution will
>> reflect the fast
>> >>>> machines.  Slower machines will be underweighted.  So the work
>> will be estimated to
>> >>>> complete quickly, and some machines will time out.  In SETI beta,
>> it resolves itself
>> >>>> in a few weeks.  I can't guarantee that it will anywhere else.
>> >>>>
>> >>>> We see this with every release of a GPU app.  The real
>> capabilities of graphics
>> >>>> cards vary by orders of magnitude from the estimate and by more
>> from each other.
>> >>>> The fast cards report first and most every else hits days of
>> timeouts.
>> >>>>
>> >>>> One possible fix so to increase the timeout limits for the first
>> 10 workunits for a
>> >>>> host_app_version, until host based estimates take over.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove
>> <[email protected]
>> >>>> <mailto:[email protected]>> wrote:
>> >>>>
>> >>>>     I think Eric Korpela would be the best person to answer that
>> question, but I
>> >>>>     suspect 'probably not': further investigation over the weekend
>> suggests that the
>> >>>>     circumstances may be SIMAP-specific.
>> >>>>
>> >>>>     It appears that the Android Whetstone benchmark used in the
>> BOINC client has
>> >>>>     separate code paths for ARM, vfp, and NEON processors: a vfp
>> or NEON processor
>> >>>>     will report that it is significantly faster than a plain-
>> vanilla ARM.
>> >>>>
>> >>>>     However, SIMAP have only deployed a single Android app, which
>> I'm assuming only
>> >>>>     uses ARM functions: devices with vfp or NEON SIMD
>> vectorisation available would
>> >>>>     run the non-optimised application much slower than BOINC
>> expects.
>> >>>>
>> >>>>     At my suggestion, Thomas Rattei (SIMAP admistrator) increased
>> the
>> >>>>     rsc_fpops_bound multiplier to 10x on Sunday afternoon. I note
>> that the maximum
>> >>>>     runtime displayed on
>> http://boincsimap.org/boincsimap/server_status.php has
>> >>>>     already increased from 11 hours to 14 hours since he did that.
>> >>>>
>> >>>>     Thomas has told me "We've seen that [EXIT_TIME_LIMIT_EXCEEDED]
>> a lot. However,
>> >>>>     due to Samsung PowerSleep, we thought these are mainly "lazy"
>> users just not
>> >>>>     using their phone regularly for computing." He's going to
>> monitor how this
>> >>>>     progresses during the remainder of the current batch, and I've
>> asked him to keep
>> >>>>     us updated on his observations.
>> >>>>
>> >>>>
>> >>>>
>> >>>>      >________________________________
>> >>>>      > From: David Anderson <[email protected]
>> <mailto:[email protected]>>
>> >>>>      >To: [email protected]
>> <mailto:[email protected]>
>> >>>>      >Sent: Monday, 9 June 2014, 3:48
>> >>>>      >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry,
>> yes me again, but
>> >>>>     please read)
>> >>>>      >
>> >>>>      >
>> >>>>      >Does this problem occur on SETI@home?
>> >>>>      >-- David
>> >>>>      >
>> >>>>      >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote:
>> >>>>      >
>> >>>>      >> 2) Android runtime estimates
>> >>>>      >>
>> >>>>      >> The example here is from SIMAP. During a recent pause
>> between batches, I noticed
>> >>>>      >> that some of my 'pending validation' tasks were being slow
>> to clear:
>> >>>>      >> http://boincsimap.org/boincsimap/results.php?hostid=349248
>> >>>>      >>
>> >>>>      >> The clearest example is the third of those three
>> workunits:
>> >>>>      >>
>> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928
>> >>>>      >>
>> >>>>      >> Four of the seven replications have failed with 'Error
>> while computing', and
>> >>>>      >> every one of those four is an EXIT_TIME_LIMIT_EXCEEDED on
>> an Android device.
>> >>>>      >>
>> >>>>      >> Three of the four hosts have never returned a valid result
>> (total credit zero),
>> >>>>      >> so they have never had a chance to establish an APR for
>> use in runtime
>> >>>>      >> estimation: runtime estimates and bounds must have been
>> generated by the server.
>> >>>>      >>
>> >>>>      >> It seems - from these results, and others I've found
>> pending on other machines -
>> >>>>      >> that SIMAP tasks on Android are aborted with
>> EXIT_TIME_LIMIT_EXCEEDED after ~6
>> >>>>      >> hours elapsed. For the new batch released today, SIMAP are
>> using a 3x bound
>> >>>>      >> (which may be a bit low under the circumstances):
>> >>>>      >>
>> >>>>      >> <rsc_fpops_est>13500000000000.000000</rsc_fpops_est>
>> >>>>      >> <rsc_fpops_bound>40500000000000.000000</rsc_fpops_bound>
>> >>>>      >>
>> >>>>      >> so I deduce that the tasks when first issued had a runtime
>> estimate of ~2 hours.
>> >>>>      >>
>> >>>>      >> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 7.34
>> GFLOPS), take over half
>> >>>>      >> an hour to complete: two hours for an ARM device sounds
>> suspiciously low. The
>> >>>>      >> only one of my Android wingmates to have registered an APR
>> >>>>      >>
>> (http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033)
>> is
>> >>>>     showing
>> >>>>      >> 1.69 GFLOPS, but I have no way of knowing whether that APR
>> was established
>> >>>>     before
>> >>>>      >> or after the task in question errored out.
>> >>>>      >>
>> >>>>      >> From experience - borne out by current tests at
>> Albert@Home, where server logs
>> >>>>      >> are helpfully exposed to the public - initial server
>> estimates can be hopelessly
>> >>>>      >> over-optimistic. These two are for the same machine:
>> >>>>      >>
>> >>>>      >> 2014-06-04 20:28:09.8459 [PID=26529] [version] [AV#716]
>> (BRP4G-cuda32-nv301)
>> >>>>      >> adjusting projected flops based on PFC avg: 2124.60G 2014-
>> 06-07 09:30:56.1506
>> >>>>      >> [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301)
>> setting projected flops
>> >>>>     based
>> >>>>      >> on host elapsed time avg: 23.71G
>> >>>>      >>
>> >>>>      >> Since SIMAP have recently announced that they are leaving
>> the BOINC platform at
>> >>>>      >> the end of the year (despite being an Android launch
>> partner with Samsung), I
>> >>>>      >> doubt they'll want to put much effort into researching
>> this issue.
>> >>>>      >>
>> >>>>      >> But if other projects experimenting with Android
>> applications are experiencing a
>> >>>>      >> high task failure rate, they might like to check whether
>> >>>>     EXIT_TIME_LIMIT_EXCEEDED
>> >>>>      >> is a significant factor in those failures, and if so,
>> consider the other
>> >>>>      >> remediation approaches (apart from outliers, which isn't
>> relevant in this case)
>> >>>>      >> that I suggested to Eric Mcintosh at LHC.
>> >>
>> >>
>> >>
>> >_______________________________________________
>> >boinc_dev mailing list
>> >[email protected]
>> >http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> >To unsubscribe, visit the above URL and
>> >(near bottom of page) enter your email address.
>> >
>> >
>> >
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
>
>_______________________________________________
>boinc_dev mailing list
>[email protected]
>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>To unsubscribe, visit the above URL and
>(near bottom of page) enter your email address.
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Reply via email to