Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

McLeod, John Thu, 12 Jun 2014 06:24:08 -0700

The reason RAC was not used was because of the delay in granting credit.  
However, if we remember that RAC is long term, then it makes more sense.  If we 
use RAC as the estimate, then we should also have the client attempt to contact 
each of the servers for an update of RAC occasionally, but it would not have to 
be very often - once a week or so would probably suffice.  If it had been more 
than a week since the last connection, it would be time to try again.


-----Original Message-----
From: boinc_dev [mailto:[email protected]] On Behalf Of 
Richard Haselgrove
Sent: Wednesday, June 11, 2014 4:23 PM
To: Eric J Korpela; David Anderson
Cc: [email protected]; Josef W. Segur
Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but 
please read)

That one made it as far as the planning document, but no further:

http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen#Proposal:credit-drivenscheduling


The surrogate, REC, is essentially speed * time, or back to square one.



>________________________________
> From: Eric J Korpela <[email protected]>
>To: David Anderson <[email protected]> 
>Cc: Richard Haselgrove <[email protected]>; Josef W. Segur 
><[email protected]>; "[email protected]" 
><[email protected]> 
>Sent: Wednesday, 11 June 2014, 21:03
>Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but 
>please read)
> 
>
>
>Another possibility that came to me years ago would be to use RAC rather than 
>estimated duration to compute the resource allocation on the client side.  
>That way on a machine running two projects with equal resource share would end 
>up spending more time running the one with lower granted credit per unit work. 
> That would encourage projects not to over grant (they would lose resources) 
>or under grant (they would lose volunteers).
>
>
>
>
>On Tue, Jun 10, 2014 at 1:03 PM, Eric J Korpela <[email protected]> 
>wrote:
>
>
>>I haven't thought about it in a while.   I had come up with a stable system 
>>that would but it wasn't simple and it also required projects to voluntarily 
>>participate.  Therefore it wouldn't have worked.
>>
>>The only thought I've had recently is to have a "calibration" plan class that 
>>has a non-SIMD non-threaded unoptimized CPU-only app_version that gets sent 
>>out once out of every N (~100,000) results. This (as the least efficient 
>>app_version) could set the pfc_scale.  Again, it would require project 
>>participation, so it wouldn't work.
>>
>>So I spend most of my time trying not to think about it.
>>
>>
>>
>>
>>
>>On Tue, Jun 10, 2014 at 12:12 PM, David Anderson <[email protected]> 
>>wrote:
>>
>>Are you saying we're taking the wrong approach?
>>>Any other suggestions?
>>>
>>>
>>>On 10-Jun-2014 11:51 AM, Eric J Korpela wrote:
>>>
>>> >For credit purposes, the standard is peak FLOPS,
>>>> >i.e. we give credit for what the device could do,
>>>> >rather than what it actually did.
>>>> >Among other things, this encourages projects to develop more efficient 
>>>>apps.
>>>>
>>>>It does the opposite because many projects care more about attracting 
>>>>volunteers
>>>>than they do about efficient computation.
>>>>
>>>>First: Per second of run time,  a host gets the same credit for a 
>>>>non-optimized
>>>>stock app as it does for an optimized stock app.  There's no benefit to the
>>>>volunteer to go to a project with optimized apps.  In fact there's a 
>>>>benefit for
>>>>users to compile an optimized app for use at a non-optimized project where 
>>>>their
>>>>credit will be higher.  Every time we optimize SETI@home we get bombarded 
>>>>by users
>>>>of non-stock optimized apps get angry because their RAC goes down.  That 
>>>>makes it a
>>>>disincentive to optimize.
>>>>
>>>>Second:  This method encourages projects to create separate apps for GPUs 
>>>>rather
>>>>than separate app_versions.  Because GPUs obtain nowhere near their 
>>>>advertised rates
>>>>for real code, a separate GPU app can earn 20 to 100x the credit of a GPU
>>>>app_version of an app that also has CPU app_versions.
>>>>
>>>>Third: It encourages projects to not use the BOINC credit granting 
>>>>mechanisms.  To
>>>>compete with projects that have GPU only apps, some projects grant 
>>>>outrageous credit
>>>>for everything.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>On Tue, Jun 10, 2014 at 11:34 AM, David Anderson <[email protected]
>>>>
>>>><mailto:[email protected]>> wrote:
>>>>
>>>>    For credit purposes, the standard is peak FLOPS,
>>>>    i.e. we give credit for what the device could do,
>>>>    rather than what it actually did.
>>>>    Among other things, this encourages projects to develop more efficient 
>>>>apps.
>>>>
>>>>    Currently we're not measuring this well for x86 CPUs,
>>>>    since our Whetstone benchmark isn't optimized.
>>>>    Ideally the BOINC client should include variants for the most common
>>>>    CPU features, as we do for ARM.
>>>>
>>>>    -- D
>>>>
>>>>
>>>>    On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote:
>>>>
>>>>        Before anybody leaps into making any changes on the basis of that 
>>>>observation, I
>>>>        think we ought to pause and consider why we have a benchmark, and 
>>>>what we
>>>>        use it for.
>>>>
>>>>        I'd suggest that in an ideal world, we would be measuring the 
>>>>actual running
>>>>        speed
>>>>        of (each project's) science applications on that particular host,
>>>>        optimisations and
>>>>        all. We gradually do this through the runtime averages anyway, but 
>>>>it's hard to
>>>>        gather a priori data on a new host.
>>>>
>>>>        Instead of (initially) measuring science application performance, 
>>>>we measure
>>>>        hardware performance as a surrogate. We now have (at least) three 
>>>>ways of
>>>>        doing that:
>>>>
>>>>        x86: minimum, most conservative, estimate, no optimisations allowed 
>>>>for.
>>>>        Android: allows for optimised hardware pathways with vfp or neon, 
>>>>but
>>>>        doesn't relate
>>>>        back to science app capability.
>>>>        GPU: maximum theoretical 'peak flops', calculated from card 
>>>>parameters, then
>>>>        scaled
>>>>        back by rule of thumb.
>>>>
>>>>        Maybe we should standardise on just one standard?
>>>>
>>>>
>>>>
        
------------------------------__------------------------------__------------------------
>>>>
>>>>             *From:* Richard Haselgrove <[email protected]
>>>>
        <mailto:[email protected]>>
>>>>
>>>>             *To:* Josef W. Segur <[email protected]
>>>>
        <mailto:[email protected]>>; David Anderson
>>>>             <[email protected] <mailto:[email protected]>>
>>>>             *Cc:* "[email protected] 
>>>><mailto:[email protected]>"
>>>>        <[email protected] <mailto:[email protected]>>
>>>>
>>>>             *Sent:* Tuesday, 10 June 2014, 9:37
>>>>             *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, 
>>>>yes me
>>>>        again, but
>>>>
>>>>             please read)
>>>>
>>>>
        
http://boinc.berkeley.edu/__gitweb/?p=boinc-v2.git;a=__commit;h=__7b2ca9e787a204f2a57f390bc7249b__b7f9997fea
>>>>        
>>>><http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea>
>>>>
>>>>              >__________________________________
>>>>
>>>>              > From: Josef W. Segur <[email protected]
>>>>
        <mailto:[email protected]> <mailto:[email protected]
>>>>
>>>>        <mailto:[email protected]>>>
>>>>              >To: David Anderson <[email protected]
>>>>
        <mailto:[email protected]> <mailto:[email protected]
>>>>        <mailto:[email protected]>__>>
>>>>              >Cc: "[email protected] 
>>>><mailto:[email protected]>
>>>>        <mailto:boinc_dev@ssl.__berkeley.edu 
>>>><mailto:[email protected]>>"
>>>>             <[email protected] <mailto:[email protected]>
>>>>        <mailto:boinc_dev@ssl.__berkeley.edu 
>>>><mailto:[email protected]>>>;
>>>>
>>>>        Eric J Korpela
>>>>             <[email protected] <mailto:[email protected]>
>>>>
        <mailto:[email protected].__edu <mailto:[email protected]>>>;
>>>>
>>>>        Richard Haselgrove
>>>>             <[email protected] 
>>>><mailto:[email protected]>
>>>>
        <mailto:r.haselgrove@__btopenworld.com 
<mailto:[email protected]>>>
>>>>
>>>>
>>>>              >Sent: Tuesday, 10 June 2014, 2:19
>>>>              >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, 
>>>>yes me
>>>>        again, but
>>>>             please read)
>>>>              >
>>>>              >
>>>>              >Consider Richard's observation:
>>>>              >
>>>>              >>>     It appears that the Android Whetstone benchmark used 
>>>>in the BOINC
>>>>             client has
>>>>              >>>     separate code paths for ARM, vfp, and NEON 
>>>>processors: a vfp
>>>>        or NEON
>>>>             processor
>>>>              >>>     will report that it is significantly faster than a
>>>>        plain-vanilla ARM.
>>>>              >
>>>>              >If that is so, it distinctly differs from the x86 Whetstone 
>>>>which
>>>>        never uses
>>>>             SIMD, and is truly conservative as you would want for 3).
>>>>              >--
>>>>              >                               Joe
>>>>              >
>>>>              >
>>>>              >
>>>>              >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson
>>>>        <[email protected] <mailto:[email protected]>
>>>>
>>>>             <mailto:[email protected] 
>>>><mailto:[email protected]>__>> wrote:
>>>>              >
>>>>              >> Eric:
>>>>              >>
>>>>              >> Yes, I suspect that's what's going on.
>>>>              >> Currently the logic for estimating job runtime
>>>>              >> (estimate_flops() in sched_version.cpp) is
>>>>              >> 1) if this (host, app version) has > 10 results, use 
>>>>(host, app
>>>>        version)
>>>>             statistics
>>>>              >> 2) if this app version has > 100 results, use app version 
>>>>statistics
>>>>              >> 3) else use a conservative estimate based on p_fpops.
>>>>              >>
>>>>              >> I'm not sure we should be doing 2) at all,
>>>>              >> since as you point out the first x100 or 1000 results for 
>>>>an app
>>>>        version
>>>>              >> will generally be from the fastest devices
>>>>              >> (and even in the steady state,
>>>>              >> app version statistics disproportionately reflect fast 
>>>>devices).
>>>>              >>
>>>>              >> I'll make this change.
>>>>              >>
>>>>              >> -- David
>>>>              >>
>>>>              >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:
>>>>              >>> I also don't have direct access to the server as well, so 
>>>>I'm
>>>>        mostly guessing.
>>>>              >>> Having separate benchmarks for neon and VFP means there's 
>>>>a broad
>>>>        bimodal
>>>>              >>> distribution for the benchmark results.  Where the mean 
>>>>falls
>>>>        depends upon
>>>>             the mix
>>>>              >>> of machines.  In general the neon machines (being newer 
>>>>and
>>>>        faster) will report
>>>>              >>> first and more often, so early on the PFC distribution 
>>>>will
>>>>        reflect the fast
>>>>              >>> machines.  Slower machines will be underweighted.  So the 
>>>>work will be
>>>>             estimated to
>>>>              >>> complete quickly, and some machines will time out.  In 
>>>>SETI beta, it
>>>>             resolves itself
>>>>              >>> in a few weeks.  I can't guarantee that it will anywhere 
>>>>else.
>>>>              >>>
>>>>              >>> We see this with every release of a GPU app.  The real
>>>>        capabilities of graphics
>>>>              >>> cards vary by orders of magnitude from the estimate and 
>>>>by more
>>>>        from each
>>>>             other.
>>>>              >>> The fast cards report first and most every else hits days 
>>>>of timeouts.
>>>>              >>>
>>>>              >>> One possible fix so to increase the timeout limits for 
>>>>the first 10
>>>>             workunits for a
>>>>              >>> host_app_version, until host based estimates take over.
>>>>              >>>
>>>>              >>>
>>>>              >>>
>>>>              >>>
>>>>              >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove
>>>>             <[email protected] 
>>>><mailto:[email protected]>
>>>>
        <mailto:r.haselgrove@__btopenworld.com 
<mailto:[email protected]>>
>>>>              >>> <mailto:r.haselgrove@__btopenworld.com
>>>>        <mailto:[email protected]>
>>>>
>>>>             <mailto:r.haselgrove@__btopenworld.com
>>>>
>>>>        <mailto:[email protected]>>>> wrote:
>>>>              >>>
>>>>              >>>     I think Eric Korpela would be the best person to 
>>>>answer that
>>>>        question,
>>>>             but I
>>>>              >>>     suspect 'probably not': further investigation over 
>>>>the weekend
>>>>        suggests
>>>>             that the
>>>>              >>>     circumstances may be SIMAP-specific.
>>>>              >>>
>>>>              >>>     It appears that the Android Whetstone benchmark used 
>>>>in the BOINC
>>>>             client has
>>>>              >>>     separate code paths for ARM, vfp, and NEON 
>>>>processors: a vfp
>>>>        or NEON
>>>>             processor
>>>>              >>>     will report that it is significantly faster than a
>>>>        plain-vanilla ARM.
>>>>              >>>
>>>>              >>>     However, SIMAP have only deployed a single Android 
>>>>app, which I'm
>>>>             assuming only
>>>>              >>>     uses ARM functions: devices with vfp or NEON SIMD 
>>>>vectorisation
>>>>             available would
>>>>              >>>     run the non-optimised application much slower than 
>>>>BOINC expects.
>>>>              >>>
>>>>              >>>     At my suggestion, Thomas Rattei (SIMAP admistrator) 
>>>>increased the
>>>>              >>>     rsc_fpops_bound multiplier to 10x on Sunday 
>>>>afternoon. I note
>>>>        that the
>>>>             maximum
>>>>              >>>     runtime displayed on
>>>>
        http://boincsimap.org/__boincsimap/server_status.php
>>>>
>>>>        <http://boincsimap.org/boincsimap/server_status.php> has
>>>>              >>>     already increased from 11 hours to 14 hours since he 
>>>>did that.
>>>>              >>>
>>>>              >>>     Thomas has told me "We've seen that 
>>>>[EXIT_TIME_LIMIT_EXCEEDED]
>>>>        a lot.
>>>>             However,
>>>>              >>>     due to Samsung PowerSleep, we thought these are 
>>>>mainly "lazy"
>>>>        users
>>>>             just not
>>>>              >>>     using their phone regularly for computing." He's 
>>>>going to
>>>>        monitor how this
>>>>              >>>     progresses during the remainder of the current batch, 
>>>>and I've
>>>>        asked
>>>>             him to keep
>>>>              >>>     us updated on his observations.
>>>>              >>>
>>>>              >>>
>>>>              >>>
>>>>
              >>>      >__________________________________
>>>>
>>>>              >>>      > From: David Anderson <[email protected]
>>>>        <mailto:[email protected]>
>>>>
             <mailto:[email protected] <mailto:[email protected]>__>
>>>>        <mailto:[email protected] <mailto:[email protected]>
>>>>
>>>>             <mailto:[email protected] 
>>>><mailto:[email protected]>__>>>
>>>>              >>>      >To: [email protected]
>>>>        <mailto:[email protected]> 
>>>><mailto:boinc_dev@ssl.__berkeley.edu
>>>>        <mailto:[email protected]>>
>>>>             <mailto:boinc_dev@ssl.__berkeley.edu
>>>>        <mailto:[email protected]> 
>>>><mailto:boinc_dev@ssl.__berkeley.edu
>>>>
>>>>        <mailto:[email protected]>>>
>>>>
>>>>              >>>     >Sent: Monday, 9 June 2014, 3:48
>>>>              >>>      >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED 
>>>>(sorry, yes me
>>>>             again, but
>>>>              >>>     please read)
>>>>              >>>      >
>>>>              >>>      >
>>>>              >>>      >Does this problem occur on SETI@home?
>>>>              >>>      >-- David
>>>>              >>>      >
>>>>              >>>      >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote:
>>>>              >>>      >
>>>>              >>>      >> 2) Android runtime estimates
>>>>              >>>      >>
>>>>              >>>      >> The example here is from SIMAP. During a recent 
>>>>pause between
>>>>             batches, I noticed
>>>>              >>>      >> that some of my 'pending validation' tasks were 
>>>>being slow
>>>>        to clear:
>>>>              >>>      >>
>>>>
        http://boincsimap.org/__boincsimap/results.php?hostid=__349248
>>>>
>>>>        <http://boincsimap.org/boincsimap/results.php?hostid=349248>
>>>>              >>>      >>
>>>>              >>>      >> The clearest example is the third of those three 
>>>>workunits:
>>>>              >>>      >>
>>>>
        http://boincsimap.org/__boincsimap/workunit.php?wuid=__57169928
>>>>
>>>>        <http://boincsimap.org/boincsimap/workunit.php?wuid=57169928>
>>>>              >>>      >>
>>>>              >>>      >> Four of the seven replications have failed with 
>>>>'Error while
>>>>             computing', and
>>>>              >>>      >> every one of those four is an 
>>>>EXIT_TIME_LIMIT_EXCEEDED on an
>>>>             Android device.
>>>>              >>>      >>
>>>>              >>>      >> Three of the four hosts have never returned a 
>>>>valid result
>>>>        (total
>>>>             credit zero),
>>>>              >>>      >> so they have never had a chance to establish an 
>>>>APR for
>>>>        use in runtime
>>>>              >>>      >> estimation: runtime estimates and bounds must 
>>>>have been
>>>>        generated
>>>>             by the server.
>>>>              >>>      >>
>>>>              >>>      >> It seems - from these results, and others I've 
>>>>found
>>>>        pending on
>>>>             other machines -
>>>>              >>>      >> that SIMAP tasks on Android are aborted with
>>>>             EXIT_TIME_LIMIT_EXCEEDED after ~6
>>>>              >>>      >> hours elapsed. For the new batch released today, 
>>>>SIMAP are
>>>>        using a
>>>>             3x bound
>>>>              >>>      >> (which may be a bit low under the circumstances):
>>>>              >>>      >>
>>>>
              >>>      >> <rsc_fpops_est>13500000000000.__000000</rsc_fpops_est>
>>>>              >>>     >> 
>>>><rsc_fpops_bound>__40500000000000.000000</rsc___fpops_bound>
>>>>
>>>>              >>>      >>
>>>>              >>>      >> so I deduce that the tasks when first issued had 
>>>>a runtime
>>>>        estimate
>>>>             of ~2 hours.
>>>>              >>>      >>
>>>>              >>>      >> My own tasks, on a fast Intel i5 'Haswell' CPU 
>>>>(APR 7.34
>>>>        GFLOPS),
>>>>             take over half
>>>>              >>>      >> an hour to complete: two hours for an ARM device 
>>>>sounds
>>>>             suspiciously low. The
>>>>              >>>      >> only one of my Android wingmates to have 
>>>>registered an APR
>>>>              >>>      >>
>>>>
>>>>
        
(http://boincsimap.org/__boincsimap/host_app_versions.__php?hostid=771033
>>>>        
>>>><http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033>) is
>>>>
>>>>              >>>     showing
>>>>              >>>      >> 1.69 GFLOPS, but I have no way of knowing whether 
>>>>that APR was
>>>>             established
>>>>              >>>     before
>>>>              >>>      >> or after the task in question errored out.
>>>>              >>>      >>
>>>>              >>>      >> From experience - borne out by current tests at
>>>>        Albert@Home, where
>>>>             server logs
>>>>              >>>      >> are helpfully exposed to the public - initial 
>>>>server
>>>>        estimates can
>>>>             be hopelessly
>>>>              >>>      >> over-optimistic. These two are for the same 
>>>>machine:
>>>>              >>>      >>
>>>>              >>>      >> 2014-06-04 20:28:09.8459 [PID=26529] [version] 
>>>>[AV#716]
>>>>             (BRP4G-cuda32-nv301)
>>>>              >>>      >> adjusting projected flops based on PFC avg: 
>>>>2124.60G
>>>>        2014-06-07
>>>>             09:30:56.1506
>>>>              >>>      >> [PID=10808] [version] [AV#716] 
>>>>(BRP4G-cuda32-nv301) setting
>>>>             projected flops
>>>>              >>>     based
>>>>              >>>      >> on host elapsed time avg: 23.71G
>>>>              >>>      >>
>>>>              >>>      >> Since SIMAP have recently announced that they are 
>>>>leaving
>>>>        the BOINC
>>>>             platform at
>>>>              >>>      >> the end of the year (despite being an Android 
>>>>launch
>>>>        partner with
>>>>             Samsung), I
>>>>              >>>      >> doubt they'll want to put much effort into 
>>>>researching
>>>>        this issue.
>>>>              >>>      >>
>>>>              >>>      >> But if other projects experimenting with Android
>>>>        applications are
>>>>             experiencing a
>>>>              >>>      >> high task failure rate, they might like to check 
>>>>whether
>>>>              >>>     EXIT_TIME_LIMIT_EXCEEDED
>>>>              >>>      >> is a significant factor in those failures, and if 
>>>>so,
>>>>        consider the
>>>>             other
>>>>              >>>      >> remediation approaches (apart from outliers, 
>>>>which isn't
>>>>        relevant
>>>>             in this case)
>>>>              >>>      >> that I suggested to Eric Mcintosh at LHC.
>>>>              >
>>>>              >
>>>>              >
>>>>
             _________________________________________________
>>>>
>>>>             boinc_dev mailing list
>>>>        [email protected] <mailto:[email protected]>
>>>>
        <mailto:boinc_dev@ssl.__berkeley.edu 
<mailto:[email protected]>>
>>>>
>>>>        http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
>>>>
>>>>        <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev>
>>>>             To unsubscribe, visit the above URL and
>>>>             (near bottom of page) enter your email address.
>>>>
>>>>
>>>>
    _________________________________________________
>>>>
>>>>    boinc_dev mailing list
>>>>    [email protected] <mailto:[email protected]>
>>>>
    http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
>>>>
>>>>    <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev>
>>>>    To unsubscribe, visit the above URL and
>>>>    (near bottom of page) enter your email address.
>>>>
>>>>
>>>>
>>
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Reply via email to