Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Eric J Korpela Tue, 10 Jun 2014 13:03:48 -0700

I haven't thought about it in a while.   I had come up with a stable system
that would but it wasn't simple and it also required projects to
voluntarily participate.  Therefore it wouldn't have worked.


The only thought I've had recently is to have a "calibration" plan class
that has a non-SIMD non-threaded unoptimized CPU-only app_version that gets
sent out once out of every N (~100,000) results. This (as the least
efficient app_version) could set the pfc_scale.  Again, it would require
project participation, so it wouldn't work.

So I spend most of my time trying not to think about it.



On Tue, Jun 10, 2014 at 12:12 PM, David Anderson <[email protected]>
wrote:

> Are you saying we're taking the wrong approach?
> Any other suggestions?
>
>
> On 10-Jun-2014 11:51 AM, Eric J Korpela wrote:
>
>>  >For credit purposes, the standard is peak FLOPS,
>>  >i.e. we give credit for what the device could do,
>>  >rather than what it actually did.
>>  >Among other things, this encourages projects to develop more efficient
>> apps.
>>
>> It does the opposite because many projects care more about attracting
>> volunteers
>> than they do about efficient computation.
>>
>> First: Per second of run time,  a host gets the same credit for a
>> non-optimized
>> stock app as it does for an optimized stock app.  There's no benefit to
>> the
>> volunteer to go to a project with optimized apps.  In fact there's a
>> benefit for
>> users to compile an optimized app for use at a non-optimized project
>> where their
>> credit will be higher.  Every time we optimize SETI@home we get
>> bombarded by users
>> of non-stock optimized apps get angry because their RAC goes down.  That
>> makes it a
>> disincentive to optimize.
>>
>> Second:  This method encourages projects to create separate apps for GPUs
>> rather
>> than separate app_versions.  Because GPUs obtain nowhere near their
>> advertised rates
>> for real code, a separate GPU app can earn 20 to 100x the credit of a GPU
>> app_version of an app that also has CPU app_versions.
>>
>> Third: It encourages projects to not use the BOINC credit granting
>> mechanisms.  To
>> compete with projects that have GPU only apps, some projects grant
>> outrageous credit
>> for everything.
>>
>>
>>
>>
>>
>> On Tue, Jun 10, 2014 at 11:34 AM, David Anderson <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     For credit purposes, the standard is peak FLOPS,
>>     i.e. we give credit for what the device could do,
>>     rather than what it actually did.
>>     Among other things, this encourages projects to develop more
>> efficient apps.
>>
>>     Currently we're not measuring this well for x86 CPUs,
>>     since our Whetstone benchmark isn't optimized.
>>     Ideally the BOINC client should include variants for the most common
>>     CPU features, as we do for ARM.
>>
>>     -- D
>>
>>
>>     On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote:
>>
>>         Before anybody leaps into making any changes on the basis of that
>> observation, I
>>         think we ought to pause and consider why we have a benchmark, and
>> what we
>>         use it for.
>>
>>         I'd suggest that in an ideal world, we would be measuring the
>> actual running
>>         speed
>>         of (each project's) science applications on that particular host,
>>         optimisations and
>>         all. We gradually do this through the runtime averages anyway,
>> but it's hard to
>>         gather a priori data on a new host.
>>
>>         Instead of (initially) measuring science application performance,
>> we measure
>>         hardware performance as a surrogate. We now have (at least) three
>> ways of
>>         doing that:
>>
>>         x86: minimum, most conservative, estimate, no optimisations
>> allowed for.
>>         Android: allows for optimised hardware pathways with vfp or neon,
>> but
>>         doesn't relate
>>         back to science app capability.
>>         GPU: maximum theoretical 'peak flops', calculated from card
>> parameters, then
>>         scaled
>>         back by rule of thumb.
>>
>>         Maybe we should standardise on just one standard?
>>
>>
>>         ------------------------------__----------------------------
>> --__------------------------
>>
>>              *From:* Richard Haselgrove <[email protected]
>>         <mailto:[email protected]>>
>>
>>              *To:* Josef W. Segur <[email protected]
>>         <mailto:[email protected]>>; David Anderson
>>              <[email protected] <mailto:[email protected]>>
>>              *Cc:* "[email protected] <mailto:boinc_dev@ssl.
>> berkeley.edu>"
>>         <[email protected] <mailto:[email protected]>>
>>
>>              *Sent:* Tuesday, 10 June 2014, 9:37
>>              *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry,
>> yes me
>>         again, but
>>
>>              please read)
>>
>>         http://boinc.berkeley.edu/__gitweb/?p=boinc-v2.git;a=__
>> commit;h=__7b2ca9e787a204f2a57f390bc7249b__b7f9997fea
>>         <http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=
>> 7b2ca9e787a204f2a57f390bc7249bb7f9997fea>
>>
>>               >__________________________________
>>
>>               > From: Josef W. Segur <[email protected]
>>         <mailto:[email protected]> <mailto:[email protected]
>>
>>         <mailto:[email protected]>>>
>>               >To: David Anderson <[email protected]
>>         <mailto:[email protected]> <mailto:[email protected]
>>         <mailto:[email protected]>__>>
>>               >Cc: "[email protected] <mailto:boinc_dev@ssl.
>> berkeley.edu>
>>         <mailto:boinc_dev@ssl.__berkeley.edu <mailto:boinc_dev@ssl.
>> berkeley.edu>>"
>>              <[email protected] <mailto:boinc_dev@ssl.
>> berkeley.edu>
>>         <mailto:boinc_dev@ssl.__berkeley.edu <mailto:boinc_dev@ssl.
>> berkeley.edu>>>;
>>
>>         Eric J Korpela
>>              <[email protected] <mailto:[email protected]>
>>         <mailto:[email protected].__edu <mailto:[email protected].
>> edu>>>;
>>
>>         Richard Haselgrove
>>              <[email protected] <mailto:r.haselgrove@
>> btopenworld.com>
>>          <mailto:r.haselgrove@__btopenworld.com <mailto:r.haselgrove@
>> btopenworld.com>>>
>>
>>
>>               >Sent: Tuesday, 10 June 2014, 2:19
>>               >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry,
>> yes me
>>         again, but
>>              please read)
>>               >
>>               >
>>               >Consider Richard's observation:
>>               >
>>               >>>     It appears that the Android Whetstone benchmark
>> used in the BOINC
>>              client has
>>               >>>     separate code paths for ARM, vfp, and NEON
>> processors: a vfp
>>         or NEON
>>              processor
>>               >>>     will report that it is significantly faster than a
>>         plain-vanilla ARM.
>>               >
>>               >If that is so, it distinctly differs from the x86
>> Whetstone which
>>         never uses
>>              SIMD, and is truly conservative as you would want for 3).
>>               >--
>>               >                               Joe
>>               >
>>               >
>>               >
>>               >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson
>>         <[email protected] <mailto:[email protected]>
>>              <mailto:[email protected] <mailto:
>> [email protected]>__>> wrote:
>>               >
>>               >> Eric:
>>               >>
>>               >> Yes, I suspect that's what's going on.
>>               >> Currently the logic for estimating job runtime
>>               >> (estimate_flops() in sched_version.cpp) is
>>               >> 1) if this (host, app version) has > 10 results, use
>> (host, app
>>         version)
>>              statistics
>>               >> 2) if this app version has > 100 results, use app
>> version statistics
>>               >> 3) else use a conservative estimate based on p_fpops.
>>               >>
>>               >> I'm not sure we should be doing 2) at all,
>>               >> since as you point out the first x100 or 1000 results
>> for an app
>>         version
>>               >> will generally be from the fastest devices
>>               >> (and even in the steady state,
>>               >> app version statistics disproportionately reflect fast
>> devices).
>>               >>
>>               >> I'll make this change.
>>               >>
>>               >> -- David
>>               >>
>>               >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:
>>               >>> I also don't have direct access to the server as well,
>> so I'm
>>         mostly guessing.
>>               >>> Having separate benchmarks for neon and VFP means
>> there's a broad
>>         bimodal
>>               >>> distribution for the benchmark results.  Where the mean
>> falls
>>         depends upon
>>              the mix
>>               >>> of machines.  In general the neon machines (being newer
>> and
>>         faster) will report
>>               >>> first and more often, so early on the PFC distribution
>> will
>>         reflect the fast
>>               >>> machines.  Slower machines will be underweighted.  So
>> the work will be
>>              estimated to
>>               >>> complete quickly, and some machines will time out.  In
>> SETI beta, it
>>              resolves itself
>>               >>> in a few weeks.  I can't guarantee that it will
>> anywhere else.
>>               >>>
>>               >>> We see this with every release of a GPU app.  The real
>>         capabilities of graphics
>>               >>> cards vary by orders of magnitude from the estimate and
>> by more
>>         from each
>>              other.
>>               >>> The fast cards report first and most every else hits
>> days of timeouts.
>>               >>>
>>               >>> One possible fix so to increase the timeout limits for
>> the first 10
>>              workunits for a
>>               >>> host_app_version, until host based estimates take over.
>>               >>>
>>               >>>
>>               >>>
>>               >>>
>>               >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove
>>              <[email protected] <mailto:r.haselgrove@
>> btopenworld.com>
>>          <mailto:r.haselgrove@__btopenworld.com <mailto:r.haselgrove@
>> btopenworld.com>>
>>               >>> <mailto:r.haselgrove@__btopenworld.com
>>         <mailto:[email protected]>
>>
>>              <mailto:r.haselgrove@__btopenworld.com
>>
>>         <mailto:[email protected]>>>> wrote:
>>               >>>
>>               >>>     I think Eric Korpela would be the best person to
>> answer that
>>         question,
>>              but I
>>               >>>     suspect 'probably not': further investigation over
>> the weekend
>>         suggests
>>              that the
>>               >>>     circumstances may be SIMAP-specific.
>>               >>>
>>               >>>     It appears that the Android Whetstone benchmark
>> used in the BOINC
>>              client has
>>               >>>     separate code paths for ARM, vfp, and NEON
>> processors: a vfp
>>         or NEON
>>              processor
>>               >>>     will report that it is significantly faster than a
>>         plain-vanilla ARM.
>>               >>>
>>               >>>     However, SIMAP have only deployed a single Android
>> app, which I'm
>>              assuming only
>>               >>>     uses ARM functions: devices with vfp or NEON SIMD
>> vectorisation
>>              available would
>>               >>>     run the non-optimised application much slower than
>> BOINC expects.
>>               >>>
>>               >>>     At my suggestion, Thomas Rattei (SIMAP admistrator)
>> increased the
>>               >>>     rsc_fpops_bound multiplier to 10x on Sunday
>> afternoon. I note
>>         that the
>>              maximum
>>               >>>     runtime displayed on
>>         http://boincsimap.org/__boincsimap/server_status.php
>>
>>         <http://boincsimap.org/boincsimap/server_status.php> has
>>               >>>     already increased from 11 hours to 14 hours since
>> he did that.
>>               >>>
>>               >>>     Thomas has told me "We've seen that
>> [EXIT_TIME_LIMIT_EXCEEDED]
>>         a lot.
>>              However,
>>               >>>     due to Samsung PowerSleep, we thought these are
>> mainly "lazy"
>>         users
>>              just not
>>               >>>     using their phone regularly for computing." He's
>> going to
>>         monitor how this
>>               >>>     progresses during the remainder of the current
>> batch, and I've
>>         asked
>>              him to keep
>>               >>>     us updated on his observations.
>>               >>>
>>               >>>
>>               >>>
>>               >>>      >__________________________________
>>
>>               >>>      > From: David Anderson <[email protected]
>>         <mailto:[email protected]>
>>              <mailto:[email protected] <mailto:
>> [email protected]>__>
>>         <mailto:[email protected] <mailto:[email protected]>
>>
>>              <mailto:[email protected] <mailto:
>> [email protected]>__>>>
>>               >>>      >To: [email protected]
>>         <mailto:[email protected]> <mailto:boinc_dev@ssl.__
>> berkeley.edu
>>         <mailto:[email protected]>>
>>              <mailto:boinc_dev@ssl.__berkeley.edu
>>         <mailto:[email protected]> <mailto:boinc_dev@ssl.__
>> berkeley.edu
>>
>>         <mailto:[email protected]>>>
>>
>>               >>>     >Sent: Monday, 9 June 2014, 3:48
>>               >>>      >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED
>> (sorry, yes me
>>              again, but
>>               >>>     please read)
>>               >>>      >
>>               >>>      >
>>               >>>      >Does this problem occur on SETI@home?
>>               >>>      >-- David
>>               >>>      >
>>               >>>      >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote:
>>               >>>      >
>>               >>>      >> 2) Android runtime estimates
>>               >>>      >>
>>               >>>      >> The example here is from SIMAP. During a recent
>> pause between
>>              batches, I noticed
>>               >>>      >> that some of my 'pending validation' tasks were
>> being slow
>>         to clear:
>>               >>>      >>
>>         http://boincsimap.org/__boincsimap/results.php?hostid=__349248
>>
>>         <http://boincsimap.org/boincsimap/results.php?hostid=349248>
>>               >>>      >>
>>               >>>      >> The clearest example is the third of those
>> three workunits:
>>               >>>      >>
>>         http://boincsimap.org/__boincsimap/workunit.php?wuid=__57169928
>>
>>         <http://boincsimap.org/boincsimap/workunit.php?wuid=57169928>
>>               >>>      >>
>>               >>>      >> Four of the seven replications have failed with
>> 'Error while
>>              computing', and
>>               >>>      >> every one of those four is an
>> EXIT_TIME_LIMIT_EXCEEDED on an
>>              Android device.
>>               >>>      >>
>>               >>>      >> Three of the four hosts have never returned a
>> valid result
>>         (total
>>              credit zero),
>>               >>>      >> so they have never had a chance to establish an
>> APR for
>>         use in runtime
>>               >>>      >> estimation: runtime estimates and bounds must
>> have been
>>         generated
>>              by the server.
>>               >>>      >>
>>               >>>      >> It seems - from these results, and others I've
>> found
>>         pending on
>>              other machines -
>>               >>>      >> that SIMAP tasks on Android are aborted with
>>              EXIT_TIME_LIMIT_EXCEEDED after ~6
>>               >>>      >> hours elapsed. For the new batch released
>> today, SIMAP are
>>         using a
>>              3x bound
>>               >>>      >> (which may be a bit low under the
>> circumstances):
>>               >>>      >>
>>               >>>      >> <rsc_fpops_est>13500000000000.
>> __000000</rsc_fpops_est>
>>               >>>     >> <rsc_fpops_bound>__40500000000000.000000</rsc___
>> fpops_bound>
>>
>>               >>>      >>
>>               >>>      >> so I deduce that the tasks when first issued
>> had a runtime
>>         estimate
>>              of ~2 hours.
>>               >>>      >>
>>               >>>      >> My own tasks, on a fast Intel i5 'Haswell' CPU
>> (APR 7.34
>>         GFLOPS),
>>              take over half
>>               >>>      >> an hour to complete: two hours for an ARM
>> device sounds
>>              suspiciously low. The
>>               >>>      >> only one of my Android wingmates to have
>> registered an APR
>>               >>>      >>
>>
>>         (http://boincsimap.org/__boincsimap/host_app_versions._
>> _php?hostid=771033
>>         <http://boincsimap.org/boincsimap/host_app_versions.
>> php?hostid=771033>) is
>>
>>               >>>     showing
>>               >>>      >> 1.69 GFLOPS, but I have no way of knowing
>> whether that APR was
>>              established
>>               >>>     before
>>               >>>      >> or after the task in question errored out.
>>               >>>      >>
>>               >>>      >> From experience - borne out by current tests at
>>         Albert@Home, where
>>              server logs
>>               >>>      >> are helpfully exposed to the public - initial
>> server
>>         estimates can
>>              be hopelessly
>>               >>>      >> over-optimistic. These two are for the same
>> machine:
>>               >>>      >>
>>               >>>      >> 2014-06-04 20:28:09.8459 [PID=26529] [version]
>> [AV#716]
>>              (BRP4G-cuda32-nv301)
>>               >>>      >> adjusting projected flops based on PFC avg:
>> 2124.60G
>>         2014-06-07
>>              09:30:56.1506
>>               >>>      >> [PID=10808] [version] [AV#716]
>> (BRP4G-cuda32-nv301) setting
>>              projected flops
>>               >>>     based
>>               >>>      >> on host elapsed time avg: 23.71G
>>               >>>      >>
>>               >>>      >> Since SIMAP have recently announced that they
>> are leaving
>>         the BOINC
>>              platform at
>>               >>>      >> the end of the year (despite being an Android
>> launch
>>         partner with
>>              Samsung), I
>>               >>>      >> doubt they'll want to put much effort into
>> researching
>>         this issue.
>>               >>>      >>
>>               >>>      >> But if other projects experimenting with Android
>>         applications are
>>              experiencing a
>>               >>>      >> high task failure rate, they might like to
>> check whether
>>               >>>     EXIT_TIME_LIMIT_EXCEEDED
>>               >>>      >> is a significant factor in those failures, and
>> if so,
>>         consider the
>>              other
>>               >>>      >> remediation approaches (apart from outliers,
>> which isn't
>>         relevant
>>              in this case)
>>               >>>      >> that I suggested to Eric Mcintosh at LHC.
>>               >
>>               >
>>               >
>>              _________________________________________________
>>
>>              boinc_dev mailing list
>>         [email protected] <mailto:[email protected]>
>>         <mailto:boinc_dev@ssl.__berkeley.edu <mailto:boinc_dev@ssl.
>> berkeley.edu>>
>>
>>         http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
>>
>>         <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev>
>>              To unsubscribe, visit the above URL and
>>              (near bottom of page) enter your email address.
>>
>>
>>     _________________________________________________
>>
>>     boinc_dev mailing list
>>     [email protected] <mailto:[email protected]>
>>     http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
>>
>>     <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev>
>>     To unsubscribe, visit the above URL and
>>     (near bottom of page) enter your email address.
>>
>>
>>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Reply via email to