Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Charles Elliott Wed, 11 Jun 2014 06:57:33 -0700

Wasn't the fundamental problem being attacked the constant credit inflation due 
to architectural improvements in CPUs and GPUs?  It is like inflation; the 
value of "credits in the bank, i.e., in the database" become worth less due 
factors people cannot control.  I don't know of any way of doing this except by 
reducing the credit allocated per FLOP.


Charles Elliott

> -----Original Message-----
> From: boinc_dev [mailto:[email protected]] On Behalf
> Of Raistmer the Sorcerer
> Sent: Tuesday, June 10, 2014 3:52 PM
> To: David Anderson
> Cc: [email protected]; Richard Haselgrove; Josef W. Segur
> Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again,
> but please read)
> 
>  Current approach to credit accounting is definitely wrong. Whole SETI
> forums discuss how much it's wrong many months already. It's almost
> impossible to avoid this topic if one ever come there.
> 
> Some suggestions could be:
> To recall why those credits are needed for BOINC at all. Correct answer
> is to ATTRACT participanst exploiting HUMAN competitive nature. Not to
> measure anything, it's social engineering first of all!
> 
> From this approach some conclusions could be done.
> It's in human nature to get angry being "less paid". Hence - NEVER
> deflate credits ! Inflation - no probs, peoples like to get more, but
> NEVER decrease amount of granting by any reason.
> 
> And that's exactly whit we get with current system.
> We working hard to optimize SETI code. Then we release app. All users
> who installed it are happy - it works faster, they get MORE credits
> with the SAME hardware.
> Then, being interesting in project we trying to incorporate found
> optimizations in project stock app. Finally new stockj app released...
> And whole mess begins. Users of stock app notice nothing - their credit
> remains the same. But THE MOST active users, that going into troubles
> to install opt apps, to go to anonymous platform and so on (let say
> biggest project fans) instantly get pissed off. Their RAC starts to
> drop ! WHY?! Because some "idiots" decided to improve stock app??! And
> flame wars on forums begins.
> All this thing absolutely not about how scientifically correct you guys
> account for FLOPS being done, it's about keeping PARTICIPANTS who
> donate resources HAPPY. And current CreditScrew gives absolutely
> diametral feelings both to participants AND developers.
> One would say quite impressive outcome...
> 
> What could be suggested for further discussion: try to calibrate not on
> stock app but on fastest app (usually this will mean anonymous platform
> app, btw) correctly computing app in the project.
> That they even if some credits will be decreased (though additional
> considerations should be done to avoid ANY drop in RAC because of any
> software replacement in stock) they would be decreased for stock users.
> This would
> 1) stimulate users to install fastest app.
> 2) stimulate project to incorporate fastest algorithms in their stock
> app.
> 
> 
> 
> Tue, 10 Jun 2014 12:12:24 -0700 от David Anderson
> <[email protected]>:
> >Are you saying we're taking the wrong approach?
> >Any other suggestions?
> >
> >On 10-Jun-2014 11:51 AM, Eric J Korpela wrote:
> >>  >For credit purposes, the standard is peak FLOPS,
> >>  >i.e. we give credit for what the device could do,
> >>  >rather than what it actually did.
> >>  >Among other things, this encourages projects to develop more
> efficient apps.
> >>
> >> It does the opposite because many projects care more about
> attracting volunteers
> >> than they do about efficient computation.
> >>
> >> First: Per second of run time,  a host gets the same credit for a
> non-optimized
> >> stock app as it does for an optimized stock app.  There's no benefit
> to the
> >> volunteer to go to a project with optimized apps.  In fact there's a
> benefit for
> >> users to compile an optimized app for use at a non-optimized project
> where their
> >> credit will be higher.  Every time we optimize SETI@home we get
> bombarded by users
> >> of non-stock optimized apps get angry because their RAC goes down.
> That makes it a
> >> disincentive to optimize.
> >>
> >> Second:  This method encourages projects to create separate apps for
> GPUs rather
> >> than separate app_versions.  Because GPUs obtain nowhere near their
> advertised rates
> >> for real code, a separate GPU app can earn 20 to 100x the credit of
> a GPU
> >> app_version of an app that also has CPU app_versions.
> >>
> >> Third: It encourages projects to not use the BOINC credit granting
> mechanisms.  To
> >> compete with projects that have GPU only apps, some projects grant
> outrageous credit
> >> for everything.
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Jun 10, 2014 at 11:34 AM, David Anderson <
> [email protected]
> >> <mailto: [email protected] >> wrote:
> >>
> >>     For credit purposes, the standard is peak FLOPS,
> >>     i.e. we give credit for what the device could do,
> >>     rather than what it actually did.
> >>     Among other things, this encourages projects to develop more
> efficient apps.
> >>
> >>     Currently we're not measuring this well for x86 CPUs,
> >>     since our Whetstone benchmark isn't optimized.
> >>     Ideally the BOINC client should include variants for the most
> common
> >>     CPU features, as we do for ARM.
> >>
> >>     -- D
> >>
> >>
> >>     On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote:
> >>
> >>         Before anybody leaps into making any changes on the basis of
> that observation, I
> >>         think we ought to pause and consider why we have a
> benchmark, and what we
> >>         use it for.
> >>
> >>         I'd suggest that in an ideal world, we would be measuring
> the actual running
> >>         speed
> >>         of (each project's) science applications on that particular
> host,
> >>         optimisations and
> >>         all. We gradually do this through the runtime averages
> anyway, but it's hard to
> >>         gather a priori data on a new host.
> >>
> >>         Instead of (initially) measuring science application
> performance, we measure
> >>         hardware performance as a surrogate. We now have (at least)
> three ways of
> >>         doing that:
> >>
> >>         x86: minimum, most conservative, estimate, no optimisations
> allowed for.
> >>         Android: allows for optimised hardware pathways with vfp or
> neon, but
> >>         doesn't relate
> >>         back to science app capability.
> >>         GPU: maximum theoretical 'peak flops', calculated from card
> parameters, then
> >>         scaled
> >>         back by rule of thumb.
> >>
> >>         Maybe we should standardise on just one standard?
> >>
> >>
> >>         ------------------------------__----------------------------
> --__------------------------
> >>              *From:* Richard Haselgrove <
> [email protected]
> >>         <mailto: [email protected] >>
> >>              *To:* Josef W. Segur < [email protected]
> >>         <mailto: [email protected] >>; David Anderson
> >>              < [email protected] <mailto:
> [email protected] >>
> >>              *Cc:* " [email protected] <mailto:
> [email protected] >"
> >>         < [email protected] <mailto:
> [email protected] >>
> >>              *Sent:* Tuesday, 10 June 2014, 9:37
> >>              *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED
> (sorry, yes me
> >>         again, but
> >>
> >>              please read)
> >>
> >>  http://boinc.berkeley.edu/__gitweb/?p=boinc-
> v2.git;a=__commit;h=__7b2ca9e787a204f2a57f390bc7249b__b7f9997fea
> >>         < http://boinc.berkeley.edu/gitweb/?p=boinc-
> v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea >
> >>
> >>               >__________________________________
> >>               > From: Josef W. Segur < [email protected]
> >>         <mailto: [email protected] > <mailto:
> [email protected]
> >>         <mailto: [email protected] >>>
> >>               >To: David Anderson < [email protected]
> >>         <mailto: [email protected] > <mailto:
> [email protected]
> >>         <mailto: [email protected] >__>>
> >>               >Cc: " [email protected] <mailto:
> [email protected] >
> >>         <mailto: boinc_dev@ssl.__berkeley.edu <mailto:
> [email protected] >>"
> >>              < [email protected] <mailto:
> [email protected] >
> >>         <mailto: boinc_dev@ssl.__berkeley.edu <mailto:
> [email protected] >>>;
> >>         Eric J Korpela
> >>              < [email protected] <mailto:
> [email protected] >
> >>         <mailto: [email protected].__edu <mailto:
> [email protected] >>>;
> >>         Richard Haselgrove
> >>              < [email protected] <mailto:
> [email protected] >
> >>         <mailto: r.haselgrove@__btopenworld.com <mailto:
> [email protected] >>>
> >>
> >>               >Sent: Tuesday, 10 June 2014, 2:19
> >>               >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED
> (sorry, yes me
> >>         again, but
> >>              please read)
> >>               >
> >>               >
> >>               >Consider Richard's observation:
> >>               >
> >>               >>>     It appears that the Android Whetstone
> benchmark used in the BOINC
> >>              client has
> >>               >>>     separate code paths for ARM, vfp, and NEON
> processors: a vfp
> >>         or NEON
> >>              processor
> >>               >>>     will report that it is significantly faster
> than a
> >>         plain-vanilla ARM.
> >>               >
> >>               >If that is so, it distinctly differs from the x86
> Whetstone which
> >>         never uses
> >>              SIMD, and is truly conservative as you would want for
> 3).
> >>               >--
> >>               >                               Joe
> >>               >
> >>               >
> >>               >
> >>               >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson
> >>         < [email protected] <mailto: [email protected] >
> >>              <mailto: [email protected] <mailto:
> [email protected] >__>> wrote:
> >>               >
> >>               >> Eric:
> >>               >>
> >>               >> Yes, I suspect that's what's going on.
> >>               >> Currently the logic for estimating job runtime
> >>               >> (estimate_flops() in sched_version.cpp) is
> >>               >> 1) if this (host, app version) has > 10 results,
> use (host, app
> >>         version)
> >>              statistics
> >>               >> 2) if this app version has > 100 results, use app
> version statistics
> >>               >> 3) else use a conservative estimate based on
> p_fpops.
> >>               >>
> >>               >> I'm not sure we should be doing 2) at all,
> >>               >> since as you point out the first x100 or 1000
> results for an app
> >>         version
> >>               >> will generally be from the fastest devices
> >>               >> (and even in the steady state,
> >>               >> app version statistics disproportionately reflect
> fast devices).
> >>               >>
> >>               >> I'll make this change.
> >>               >>
> >>               >> -- David
> >>               >>
> >>               >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:
> >>               >>> I also don't have direct access to the server as
> well, so I'm
> >>         mostly guessing.
> >>               >>> Having separate benchmarks for neon and VFP means
> there's a broad
> >>         bimodal
> >>               >>> distribution for the benchmark results.  Where the
> mean falls
> >>         depends upon
> >>              the mix
> >>               >>> of machines.  In general the neon machines (being
> newer and
> >>         faster) will report
> >>               >>> first and more often, so early on the PFC
> distribution will
> >>         reflect the fast
> >>               >>> machines.  Slower machines will be underweighted.
> So the work will be
> >>              estimated to
> >>               >>> complete quickly, and some machines will time out.
> In SETI beta, it
> >>              resolves itself
> >>               >>> in a few weeks.  I can't guarantee that it will
> anywhere else.
> >>               >>>
> >>               >>> We see this with every release of a GPU app.  The
> real
> >>         capabilities of graphics
> >>               >>> cards vary by orders of magnitude from the
> estimate and by more
> >>         from each
> >>              other.
> >>               >>> The fast cards report first and most every else
> hits days of timeouts.
> >>               >>>
> >>               >>> One possible fix so to increase the timeout limits
> for the first 10
> >>              workunits for a
> >>               >>> host_app_version, until host based estimates take
> over.
> >>               >>>
> >>               >>>
> >>               >>>
> >>               >>>
> >>               >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove
> >>              < [email protected] <mailto:
> [email protected] >
> >>         <mailto: r.haselgrove@__btopenworld.com <mailto:
> [email protected] >>
> >>               >>> <mailto: r.haselgrove@__btopenworld.com
> >>         <mailto: [email protected] >
> >>
> >>              <mailto: r.haselgrove@__btopenworld.com
> >>         <mailto: [email protected] >>>> wrote:
> >>               >>>
> >>               >>>     I think Eric Korpela would be the best person
> to answer that
> >>         question,
> >>              but I
> >>               >>>     suspect 'probably not': further investigation
> over the weekend
> >>         suggests
> >>              that the
> >>               >>>     circumstances may be SIMAP-specific.
> >>               >>>
> >>               >>>     It appears that the Android Whetstone
> benchmark used in the BOINC
> >>              client has
> >>               >>>     separate code paths for ARM, vfp, and NEON
> processors: a vfp
> >>         or NEON
> >>              processor
> >>               >>>     will report that it is significantly faster
> than a
> >>         plain-vanilla ARM.
> >>               >>>
> >>               >>>     However, SIMAP have only deployed a single
> Android app, which I'm
> >>              assuming only
> >>               >>>     uses ARM functions: devices with vfp or NEON
> SIMD vectorisation
> >>              available would
> >>               >>>     run the non-optimised application much slower
> than BOINC expects.
> >>               >>>
> >>               >>>     At my suggestion, Thomas Rattei (SIMAP
> admistrator) increased the
> >>               >>>     rsc_fpops_bound multiplier to 10x on Sunday
> afternoon. I note
> >>         that the
> >>              maximum
> >>               >>>     runtime displayed on
> >>  http://boincsimap.org/__boincsimap/server_status.php
> >>         < http://boincsimap.org/boincsimap/server_status.php > has
> >>               >>>     already increased from 11 hours to 14 hours
> since he did that.
> >>               >>>
> >>               >>>     Thomas has told me "We've seen that
> [EXIT_TIME_LIMIT_EXCEEDED]
> >>         a lot.
> >>              However,
> >>               >>>     due to Samsung PowerSleep, we thought these
> are mainly "lazy"
> >>         users
> >>              just not
> >>               >>>     using their phone regularly for computing."
> He's going to
> >>         monitor how this
> >>               >>>     progresses during the remainder of the current
> batch, and I've
> >>         asked
> >>              him to keep
> >>               >>>     us updated on his observations.
> >>               >>>
> >>               >>>
> >>               >>>
> >>               >>>      >__________________________________
> >>               >>>      > From: David Anderson <
> [email protected]
> >>         <mailto: [email protected] >
> >>              <mailto: [email protected] <mailto:
> [email protected] >__>
> >>         <mailto: [email protected] <mailto:
> [email protected] >
> >>
> >>              <mailto: [email protected] <mailto:
> [email protected] >__>>>
> >>               >>>      >To:  [email protected]
> >>         <mailto: [email protected] > <mailto:
> boinc_dev@ssl.__berkeley.edu
> >>         <mailto: [email protected] >>
> >>              <mailto: boinc_dev@ssl.__berkeley.edu
> >>         <mailto: [email protected] > <mailto:
> boinc_dev@ssl.__berkeley.edu
> >>         <mailto: [email protected] >>>
> >>
> >>               >>>     >Sent: Monday, 9 June 2014, 3:48
> >>               >>>      >Subject: Re: [boinc_dev]
> EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me
> >>              again, but
> >>               >>>     please read)
> >>               >>>      >
> >>               >>>      >
> >>               >>>      >Does this problem occur on SETI@home?
> >>               >>>      >-- David
> >>               >>>      >
> >>               >>>      >On 07-Jun-2014 2:51 AM, Richard Haselgrove
> wrote:
> >>               >>>      >
> >>               >>>      >> 2) Android runtime estimates
> >>               >>>      >>
> >>               >>>      >> The example here is from SIMAP. During a
> recent pause between
> >>              batches, I noticed
> >>               >>>      >> that some of my 'pending validation' tasks
> were being slow
> >>         to clear:
> >>               >>>      >>
> >>  http://boincsimap.org/__boincsimap/results.php?hostid=__349248
> >>         < http://boincsimap.org/boincsimap/results.php?hostid=349248
> >
> >>               >>>      >>
> >>               >>>      >> The clearest example is the third of those
> three workunits:
> >>               >>>      >>
> >>  http://boincsimap.org/__boincsimap/workunit.php?wuid=__57169928
> >>         <
> http://boincsimap.org/boincsimap/workunit.php?wuid=57169928 >
> >>               >>>      >>
> >>               >>>      >> Four of the seven replications have failed
> with 'Error while
> >>              computing', and
> >>               >>>      >> every one of those four is an
> EXIT_TIME_LIMIT_EXCEEDED on an
> >>              Android device.
> >>               >>>      >>
> >>               >>>      >> Three of the four hosts have never
> returned a valid result
> >>         (total
> >>              credit zero),
> >>               >>>      >> so they have never had a chance to
> establish an APR for
> >>         use in runtime
> >>               >>>      >> estimation: runtime estimates and bounds
> must have been
> >>         generated
> >>              by the server.
> >>               >>>      >>
> >>               >>>      >> It seems - from these results, and others
> I've found
> >>         pending on
> >>              other machines -
> >>               >>>      >> that SIMAP tasks on Android are aborted
> with
> >>              EXIT_TIME_LIMIT_EXCEEDED after ~6
> >>               >>>      >> hours elapsed. For the new batch released
> today, SIMAP are
> >>         using a
> >>              3x bound
> >>               >>>      >> (which may be a bit low under the
> circumstances):
> >>               >>>      >>
> >>               >>>      >>
> <rsc_fpops_est>13500000000000.__000000</rsc_fpops_est>
> >>               >>>     >>
> <rsc_fpops_bound>__40500000000000.000000</rsc___fpops_bound>
> >>               >>>      >>
> >>               >>>      >> so I deduce that the tasks when first
> issued had a runtime
> >>         estimate
> >>              of ~2 hours.
> >>               >>>      >>
> >>               >>>      >> My own tasks, on a fast Intel i5 'Haswell'
> CPU (APR 7.34
> >>         GFLOPS),
> >>              take over half
> >>               >>>      >> an hour to complete: two hours for an ARM
> device sounds
> >>              suspiciously low. The
> >>               >>>      >> only one of my Android wingmates to have
> registered an APR
> >>               >>>      >>
> >>
> >>         (
> http://boincsimap.org/__boincsimap/host_app_versions.__php?hostid=77103
> 3
> >>         <
> http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033 >)
> is
> >>               >>>     showing
> >>               >>>      >> 1.69 GFLOPS, but I have no way of knowing
> whether that APR was
> >>              established
> >>               >>>     before
> >>               >>>      >> or after the task in question errored out.
> >>               >>>      >>
> >>               >>>      >> From experience - borne out by current
> tests at
> >>         Albert@Home, where
> >>              server logs
> >>               >>>      >> are helpfully exposed to the public -
> initial server
> >>         estimates can
> >>              be hopelessly
> >>               >>>      >> over-optimistic. These two are for the
> same machine:
> >>               >>>      >>
> >>               >>>      >> 2014-06-04 20:28:09.8459 [PID=26529]
> [version] [AV#716]
> >>              (BRP4G-cuda32-nv301)
> >>               >>>      >> adjusting projected flops based on PFC
> avg: 2124.60G
> >>         2014-06-07
> >>              09:30:56.1506
> >>               >>>      >> [PID=10808] [version] [AV#716] (BRP4G-
> cuda32-nv301) setting
> >>              projected flops
> >>               >>>     based
> >>               >>>      >> on host elapsed time avg: 23.71G
> >>               >>>      >>
> >>               >>>      >> Since SIMAP have recently announced that
> they are leaving
> >>         the BOINC
> >>              platform at
> >>               >>>      >> the end of the year (despite being an
> Android launch
> >>         partner with
> >>              Samsung), I
> >>               >>>      >> doubt they'll want to put much effort into
> researching
> >>         this issue.
> >>               >>>      >>
> >>               >>>      >> But if other projects experimenting with
> Android
> >>         applications are
> >>              experiencing a
> >>               >>>      >> high task failure rate, they might like to
> check whether
> >>               >>>     EXIT_TIME_LIMIT_EXCEEDED
> >>               >>>      >> is a significant factor in those failures,
> and if so,
> >>         consider the
> >>              other
> >>               >>>      >> remediation approaches (apart from
> outliers, which isn't
> >>         relevant
> >>              in this case)
> >>               >>>      >> that I suggested to Eric Mcintosh at LHC.
> >>               >
> >>               >
> >>               >
> >>              _________________________________________________
> >>              boinc_dev mailing list
> >>  [email protected] <mailto: [email protected] >
> >>         <mailto: boinc_dev@ssl.__berkeley.edu <mailto:
> [email protected] >>
> >>
> >>  http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
> >>         < http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >
> >>              To unsubscribe, visit the above URL and
> >>              (near bottom of page) enter your email address.
> >>
> >>
> >>     _________________________________________________
> >>     boinc_dev mailing list
> >>  [email protected] <mailto: [email protected] >
> >>  http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
> >>     < http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >
> >>     To unsubscribe, visit the above URL and
> >>     (near bottom of page) enter your email address.
> >>
> >>
> >_______________________________________________
> >boinc_dev mailing list
> >[email protected]
> >http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> >To unsubscribe, visit the above URL and
> >(near bottom of page) enter your email address.
> 
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Reply via email to