Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Richard Haselgrove Wed, 11 Jun 2014 13:27:07 -0700

That one made it as far as the planning document, but no further:

http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen#Proposal:credit-drivenscheduling



The surrogate, REC, is essentially speed * time, or back to square one.



>________________________________
> From: Eric J Korpela <[email protected]>
>To: David Anderson <[email protected]> 
>Cc: Richard Haselgrove <[email protected]>; Josef W. Segur 
><[email protected]>; "[email protected]" 
><[email protected]> 
>Sent: Wednesday, 11 June 2014, 21:03
>Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but 
>please read)
> 
>
>
>Another possibility that came to me years ago would be to use RAC rather than 
>estimated duration to compute the resource allocation on the client side.  
>That way on a machine running two projects with equal resource share would end 
>up spending more time running the one with lower granted credit per unit work. 
> That would encourage projects not to over grant (they would lose resources) 
>or under grant (they would lose volunteers).
>
>
>
>
>On Tue, Jun 10, 2014 at 1:03 PM, Eric J Korpela <[email protected]> 
>wrote:
>
>
>>I haven't thought about it in a while.   I had come up with a stable system 
>>that would but it wasn't simple and it also required projects to voluntarily 
>>participate.  Therefore it wouldn't have worked.
>>
>>The only thought I've had recently is to have a "calibration" plan class that 
>>has a non-SIMD non-threaded unoptimized CPU-only app_version that gets sent 
>>out once out of every N (~100,000) results. This (as the least efficient 
>>app_version) could set the pfc_scale.  Again, it would require project 
>>participation, so it wouldn't work.
>>
>>So I spend most of my time trying not to think about it.
>>
>>
>>
>>
>>
>>On Tue, Jun 10, 2014 at 12:12 PM, David Anderson <[email protected]> 
>>wrote:
>>
>>Are you saying we're taking the wrong approach?
>>>Any other suggestions?
>>>
>>>
>>>On 10-Jun-2014 11:51 AM, Eric J Korpela wrote:
>>>
>>> >For credit purposes, the standard is peak FLOPS,
>>>> >i.e. we give credit for what the device could do,
>>>> >rather than what it actually did.
>>>> >Among other things, this encourages projects to develop more efficient 
>>>>apps.
>>>>
>>>>It does the opposite because many projects care more about attracting 
>>>>volunteers
>>>>than they do about efficient computation.
>>>>
>>>>First: Per second of run time,  a host gets the same credit for a 
>>>>non-optimized
>>>>stock app as it does for an optimized stock app.  There's no benefit to the
>>>>volunteer to go to a project with optimized apps.  In fact there's a 
>>>>benefit for
>>>>users to compile an optimized app for use at a non-optimized project where 
>>>>their
>>>>credit will be higher.  Every time we optimize SETI@home we get bombarded 
>>>>by users
>>>>of non-stock optimized apps get angry because their RAC goes down.  That 
>>>>makes it a
>>>>disincentive to optimize.
>>>>
>>>>Second:  This method encourages projects to create separate apps for GPUs 
>>>>rather
>>>>than separate app_versions.  Because GPUs obtain nowhere near their 
>>>>advertised rates
>>>>for real code, a separate GPU app can earn 20 to 100x the credit of a GPU
>>>>app_version of an app that also has CPU app_versions.
>>>>
>>>>Third: It encourages projects to not use the BOINC credit granting 
>>>>mechanisms.  To
>>>>compete with projects that have GPU only apps, some projects grant 
>>>>outrageous credit
>>>>for everything.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>On Tue, Jun 10, 2014 at 11:34 AM, David Anderson <[email protected]
>>>>
>>>><mailto:[email protected]>> wrote:
>>>>
>>>>    For credit purposes, the standard is peak FLOPS,
>>>>    i.e. we give credit for what the device could do,
>>>>    rather than what it actually did.
>>>>    Among other things, this encourages projects to develop more efficient 
>>>>apps.
>>>>
>>>>    Currently we're not measuring this well for x86 CPUs,
>>>>    since our Whetstone benchmark isn't optimized.
>>>>    Ideally the BOINC client should include variants for the most common
>>>>    CPU features, as we do for ARM.
>>>>
>>>>    -- D
>>>>
>>>>
>>>>    On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote:
>>>>
>>>>        Before anybody leaps into making any changes on the basis of that 
>>>>observation, I
>>>>        think we ought to pause and consider why we have a benchmark, and 
>>>>what we
>>>>        use it for.
>>>>
>>>>        I'd suggest that in an ideal world, we would be measuring the 
>>>>actual running
>>>>        speed
>>>>        of (each project's) science applications on that particular host,
>>>>        optimisations and
>>>>        all. We gradually do this through the runtime averages anyway, but 
>>>>it's hard to
>>>>        gather a priori data on a new host.
>>>>
>>>>        Instead of (initially) measuring science application performance, 
>>>>we measure
>>>>        hardware performance as a surrogate. We now have (at least) three 
>>>>ways of
>>>>        doing that:
>>>>
>>>>        x86: minimum, most conservative, estimate, no optimisations allowed 
>>>>for.
>>>>        Android: allows for optimised hardware pathways with vfp or neon, 
>>>>but
>>>>        doesn't relate
>>>>        back to science app capability.
>>>>        GPU: maximum theoretical 'peak flops', calculated from card 
>>>>parameters, then
>>>>        scaled
>>>>        back by rule of thumb.
>>>>
>>>>        Maybe we should standardise on just one standard?
>>>>
>>>>
>>>>
        
------------------------------__------------------------------__------------------------
>>>>
>>>>             *From:* Richard Haselgrove <[email protected]
>>>>
        <mailto:[email protected]>>
>>>>
>>>>             *To:* Josef W. Segur <[email protected]
>>>>
        <mailto:[email protected]>>; David Anderson
>>>>             <[email protected] <mailto:[email protected]>>
>>>>             *Cc:* "[email protected] 
>>>><mailto:[email protected]>"
>>>>        <[email protected] <mailto:[email protected]>>
>>>>
>>>>             *Sent:* Tuesday, 10 June 2014, 9:37
>>>>             *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, 
>>>>yes me
>>>>        again, but
>>>>
>>>>             please read)
>>>>
>>>>
        
http://boinc.berkeley.edu/__gitweb/?p=boinc-v2.git;a=__commit;h=__7b2ca9e787a204f2a57f390bc7249b__b7f9997fea
>>>>        
>>>><http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea>
>>>>
>>>>              >__________________________________
>>>>
>>>>              > From: Josef W. Segur <[email protected]
>>>>
        <mailto:[email protected]> <mailto:[email protected]
>>>>
>>>>        <mailto:[email protected]>>>
>>>>              >To: David Anderson <[email protected]
>>>>
        <mailto:[email protected]> <mailto:[email protected]
>>>>        <mailto:[email protected]>__>>
>>>>              >Cc: "[email protected] 
>>>><mailto:[email protected]>
>>>>        <mailto:boinc_dev@ssl.__berkeley.edu 
>>>><mailto:[email protected]>>"
>>>>             <[email protected] <mailto:[email protected]>
>>>>        <mailto:boinc_dev@ssl.__berkeley.edu 
>>>><mailto:[email protected]>>>;
>>>>
>>>>        Eric J Korpela
>>>>             <[email protected] <mailto:[email protected]>
>>>>
        <mailto:[email protected].__edu <mailto:[email protected]>>>;
>>>>
>>>>        Richard Haselgrove
>>>>             <[email protected] 
>>>><mailto:[email protected]>
>>>>
        <mailto:r.haselgrove@__btopenworld.com 
<mailto:[email protected]>>>
>>>>
>>>>
>>>>              >Sent: Tuesday, 10 June 2014, 2:19
>>>>              >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, 
>>>>yes me
>>>>        again, but
>>>>             please read)
>>>>              >
>>>>              >
>>>>              >Consider Richard's observation:
>>>>              >
>>>>              >>>     It appears that the Android Whetstone benchmark used 
>>>>in the BOINC
>>>>             client has
>>>>              >>>     separate code paths for ARM, vfp, and NEON 
>>>>processors: a vfp
>>>>        or NEON
>>>>             processor
>>>>              >>>     will report that it is significantly faster than a
>>>>        plain-vanilla ARM.
>>>>              >
>>>>              >If that is so, it distinctly differs from the x86 Whetstone 
>>>>which
>>>>        never uses
>>>>             SIMD, and is truly conservative as you would want for 3).
>>>>              >--
>>>>              >                               Joe
>>>>              >
>>>>              >
>>>>              >
>>>>              >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson
>>>>        <[email protected] <mailto:[email protected]>
>>>>
>>>>             <mailto:[email protected] 
>>>><mailto:[email protected]>__>> wrote:
>>>>              >
>>>>              >> Eric:
>>>>              >>
>>>>              >> Yes, I suspect that's what's going on.
>>>>              >> Currently the logic for estimating job runtime
>>>>              >> (estimate_flops() in sched_version.cpp) is
>>>>              >> 1) if this (host, app version) has > 10 results, use 
>>>>(host, app
>>>>        version)
>>>>             statistics
>>>>              >> 2) if this app version has > 100 results, use app version 
>>>>statistics
>>>>              >> 3) else use a conservative estimate based on p_fpops.
>>>>              >>
>>>>              >> I'm not sure we should be doing 2) at all,
>>>>              >> since as you point out the first x100 or 1000 results for 
>>>>an app
>>>>        version
>>>>              >> will generally be from the fastest devices
>>>>              >> (and even in the steady state,
>>>>              >> app version statistics disproportionately reflect fast 
>>>>devices).
>>>>              >>
>>>>              >> I'll make this change.
>>>>              >>
>>>>              >> -- David
>>>>              >>
>>>>              >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:
>>>>              >>> I also don't have direct access to the server as well, so 
>>>>I'm
>>>>        mostly guessing.
>>>>              >>> Having separate benchmarks for neon and VFP means there's 
>>>>a broad
>>>>        bimodal
>>>>              >>> distribution for the benchmark results.  Where the mean 
>>>>falls
>>>>        depends upon
>>>>             the mix
>>>>              >>> of machines.  In general the neon machines (being newer 
>>>>and
>>>>        faster) will report
>>>>              >>> first and more often, so early on the PFC distribution 
>>>>will
>>>>        reflect the fast
>>>>              >>> machines.  Slower machines will be underweighted.  So the 
>>>>work will be
>>>>             estimated to
>>>>              >>> complete quickly, and some machines will time out.  In 
>>>>SETI beta, it
>>>>             resolves itself
>>>>              >>> in a few weeks.  I can't guarantee that it will anywhere 
>>>>else.
>>>>              >>>
>>>>              >>> We see this with every release of a GPU app.  The real
>>>>        capabilities of graphics
>>>>              >>> cards vary by orders of magnitude from the estimate and 
>>>>by more
>>>>        from each
>>>>             other.
>>>>              >>> The fast cards report first and most every else hits days 
>>>>of timeouts.
>>>>              >>>
>>>>              >>> One possible fix so to increase the timeout limits for 
>>>>the first 10
>>>>             workunits for a
>>>>              >>> host_app_version, until host based estimates take over.
>>>>              >>>
>>>>              >>>
>>>>              >>>
>>>>              >>>
>>>>              >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove
>>>>             <[email protected] 
>>>><mailto:[email protected]>
>>>>
        <mailto:r.haselgrove@__btopenworld.com 
<mailto:[email protected]>>
>>>>              >>> <mailto:r.haselgrove@__btopenworld.com
>>>>        <mailto:[email protected]>
>>>>
>>>>             <mailto:r.haselgrove@__btopenworld.com
>>>>
>>>>        <mailto:[email protected]>>>> wrote:
>>>>              >>>
>>>>              >>>     I think Eric Korpela would be the best person to 
>>>>answer that
>>>>        question,
>>>>             but I
>>>>              >>>     suspect 'probably not': further investigation over 
>>>>the weekend
>>>>        suggests
>>>>             that the
>>>>              >>>     circumstances may be SIMAP-specific.
>>>>              >>>
>>>>              >>>     It appears that the Android Whetstone benchmark used 
>>>>in the BOINC
>>>>             client has
>>>>              >>>     separate code paths for ARM, vfp, and NEON 
>>>>processors: a vfp
>>>>        or NEON
>>>>             processor
>>>>              >>>     will report that it is significantly faster than a
>>>>        plain-vanilla ARM.
>>>>              >>>
>>>>              >>>     However, SIMAP have only deployed a single Android 
>>>>app, which I'm
>>>>             assuming only
>>>>              >>>     uses ARM functions: devices with vfp or NEON SIMD 
>>>>vectorisation
>>>>             available would
>>>>              >>>     run the non-optimised application much slower than 
>>>>BOINC expects.
>>>>              >>>
>>>>              >>>     At my suggestion, Thomas Rattei (SIMAP admistrator) 
>>>>increased the
>>>>              >>>     rsc_fpops_bound multiplier to 10x on Sunday 
>>>>afternoon. I note
>>>>        that the
>>>>             maximum
>>>>              >>>     runtime displayed on
>>>>
        http://boincsimap.org/__boincsimap/server_status.php
>>>>
>>>>        <http://boincsimap.org/boincsimap/server_status.php> has
>>>>              >>>     already increased from 11 hours to 14 hours since he 
>>>>did that.
>>>>              >>>
>>>>              >>>     Thomas has told me "We've seen that 
>>>>[EXIT_TIME_LIMIT_EXCEEDED]
>>>>        a lot.
>>>>             However,
>>>>              >>>     due to Samsung PowerSleep, we thought these are 
>>>>mainly "lazy"
>>>>        users
>>>>             just not
>>>>              >>>     using their phone regularly for computing." He's 
>>>>going to
>>>>        monitor how this
>>>>              >>>     progresses during the remainder of the current batch, 
>>>>and I've
>>>>        asked
>>>>             him to keep
>>>>              >>>     us updated on his observations.
>>>>              >>>
>>>>              >>>
>>>>              >>>
>>>>
              >>>      >__________________________________
>>>>
>>>>              >>>      > From: David Anderson <[email protected]
>>>>        <mailto:[email protected]>
>>>>
             <mailto:[email protected] <mailto:[email protected]>__>
>>>>        <mailto:[email protected] <mailto:[email protected]>
>>>>
>>>>             <mailto:[email protected] 
>>>><mailto:[email protected]>__>>>
>>>>              >>>      >To: [email protected]
>>>>        <mailto:[email protected]> 
>>>><mailto:boinc_dev@ssl.__berkeley.edu
>>>>        <mailto:[email protected]>>
>>>>             <mailto:boinc_dev@ssl.__berkeley.edu
>>>>        <mailto:[email protected]> 
>>>><mailto:boinc_dev@ssl.__berkeley.edu
>>>>
>>>>        <mailto:[email protected]>>>
>>>>
>>>>              >>>     >Sent: Monday, 9 June 2014, 3:48
>>>>              >>>      >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED 
>>>>(sorry, yes me
>>>>             again, but
>>>>              >>>     please read)
>>>>              >>>      >
>>>>              >>>      >
>>>>              >>>      >Does this problem occur on SETI@home?
>>>>              >>>      >-- David
>>>>              >>>      >
>>>>              >>>      >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote:
>>>>              >>>      >
>>>>              >>>      >> 2) Android runtime estimates
>>>>              >>>      >>
>>>>              >>>      >> The example here is from SIMAP. During a recent 
>>>>pause between
>>>>             batches, I noticed
>>>>              >>>      >> that some of my 'pending validation' tasks were 
>>>>being slow
>>>>        to clear:
>>>>              >>>      >>
>>>>
        http://boincsimap.org/__boincsimap/results.php?hostid=__349248
>>>>
>>>>        <http://boincsimap.org/boincsimap/results.php?hostid=349248>
>>>>              >>>      >>
>>>>              >>>      >> The clearest example is the third of those three 
>>>>workunits:
>>>>              >>>      >>
>>>>
        http://boincsimap.org/__boincsimap/workunit.php?wuid=__57169928
>>>>
>>>>        <http://boincsimap.org/boincsimap/workunit.php?wuid=57169928>
>>>>              >>>      >>
>>>>              >>>      >> Four of the seven replications have failed with 
>>>>'Error while
>>>>             computing', and
>>>>              >>>      >> every one of those four is an 
>>>>EXIT_TIME_LIMIT_EXCEEDED on an
>>>>             Android device.
>>>>              >>>      >>
>>>>              >>>      >> Three of the four hosts have never returned a 
>>>>valid result
>>>>        (total
>>>>             credit zero),
>>>>              >>>      >> so they have never had a chance to establish an 
>>>>APR for
>>>>        use in runtime
>>>>              >>>      >> estimation: runtime estimates and bounds must 
>>>>have been
>>>>        generated
>>>>             by the server.
>>>>              >>>      >>
>>>>              >>>      >> It seems - from these results, and others I've 
>>>>found
>>>>        pending on
>>>>             other machines -
>>>>              >>>      >> that SIMAP tasks on Android are aborted with
>>>>             EXIT_TIME_LIMIT_EXCEEDED after ~6
>>>>              >>>      >> hours elapsed. For the new batch released today, 
>>>>SIMAP are
>>>>        using a
>>>>             3x bound
>>>>              >>>      >> (which may be a bit low under the circumstances):
>>>>              >>>      >>
>>>>
              >>>      >> <rsc_fpops_est>13500000000000.__000000</rsc_fpops_est>
>>>>              >>>     >> 
>>>><rsc_fpops_bound>__40500000000000.000000</rsc___fpops_bound>
>>>>
>>>>              >>>      >>
>>>>              >>>      >> so I deduce that the tasks when first issued had 
>>>>a runtime
>>>>        estimate
>>>>             of ~2 hours.
>>>>              >>>      >>
>>>>              >>>      >> My own tasks, on a fast Intel i5 'Haswell' CPU 
>>>>(APR 7.34
>>>>        GFLOPS),
>>>>             take over half
>>>>              >>>      >> an hour to complete: two hours for an ARM device 
>>>>sounds
>>>>             suspiciously low. The
>>>>              >>>      >> only one of my Android wingmates to have 
>>>>registered an APR
>>>>              >>>      >>
>>>>
>>>>
        
(http://boincsimap.org/__boincsimap/host_app_versions.__php?hostid=771033
>>>>        
>>>><http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033>) is
>>>>
>>>>              >>>     showing
>>>>              >>>      >> 1.69 GFLOPS, but I have no way of knowing whether 
>>>>that APR was
>>>>             established
>>>>              >>>     before
>>>>              >>>      >> or after the task in question errored out.
>>>>              >>>      >>
>>>>              >>>      >> From experience - borne out by current tests at
>>>>        Albert@Home, where
>>>>             server logs
>>>>              >>>      >> are helpfully exposed to the public - initial 
>>>>server
>>>>        estimates can
>>>>             be hopelessly
>>>>              >>>      >> over-optimistic. These two are for the same 
>>>>machine:
>>>>              >>>      >>
>>>>              >>>      >> 2014-06-04 20:28:09.8459 [PID=26529] [version] 
>>>>[AV#716]
>>>>             (BRP4G-cuda32-nv301)
>>>>              >>>      >> adjusting projected flops based on PFC avg: 
>>>>2124.60G
>>>>        2014-06-07
>>>>             09:30:56.1506
>>>>              >>>      >> [PID=10808] [version] [AV#716] 
>>>>(BRP4G-cuda32-nv301) setting
>>>>             projected flops
>>>>              >>>     based
>>>>              >>>      >> on host elapsed time avg: 23.71G
>>>>              >>>      >>
>>>>              >>>      >> Since SIMAP have recently announced that they are 
>>>>leaving
>>>>        the BOINC
>>>>             platform at
>>>>              >>>      >> the end of the year (despite being an Android 
>>>>launch
>>>>        partner with
>>>>             Samsung), I
>>>>              >>>      >> doubt they'll want to put much effort into 
>>>>researching
>>>>        this issue.
>>>>              >>>      >>
>>>>              >>>      >> But if other projects experimenting with Android
>>>>        applications are
>>>>             experiencing a
>>>>              >>>      >> high task failure rate, they might like to check 
>>>>whether
>>>>              >>>     EXIT_TIME_LIMIT_EXCEEDED
>>>>              >>>      >> is a significant factor in those failures, and if 
>>>>so,
>>>>        consider the
>>>>             other
>>>>              >>>      >> remediation approaches (apart from outliers, 
>>>>which isn't
>>>>        relevant
>>>>             in this case)
>>>>              >>>      >> that I suggested to Eric Mcintosh at LHC.
>>>>              >
>>>>              >
>>>>              >
>>>>
             _________________________________________________
>>>>
>>>>             boinc_dev mailing list
>>>>        [email protected] <mailto:[email protected]>
>>>>
        <mailto:boinc_dev@ssl.__berkeley.edu 
<mailto:[email protected]>>
>>>>
>>>>        http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
>>>>
>>>>        <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev>
>>>>             To unsubscribe, visit the above URL and
>>>>             (near bottom of page) enter your email address.
>>>>
>>>>
>>>>
    _________________________________________________
>>>>
>>>>    boinc_dev mailing list
>>>>    [email protected] <mailto:[email protected]>
>>>>
    http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
>>>>
>>>>    <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev>
>>>>    To unsubscribe, visit the above URL and
>>>>    (near bottom of page) enter your email address.
>>>>
>>>>
>>>>
>>
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Reply via email to