Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

David Anderson Tue, 10 Jun 2014 12:13:39 -0700

Are you saying we're taking the wrong approach?
Any other suggestions?


On 10-Jun-2014 11:51 AM, Eric J Korpela wrote:

 >For credit purposes, the standard is peak FLOPS,
 >i.e. we give credit for what the device could do,
 >rather than what it actually did.
 >Among other things, this encourages projects to develop more efficient apps.

It does the opposite because many projects care more about attracting volunteers
than they do about efficient computation.

First: Per second of run time,  a host gets the same credit for a non-optimized
stock app as it does for an optimized stock app.  There's no benefit to the
volunteer to go to a project with optimized apps.  In fact there's a benefit for
users to compile an optimized app for use at a non-optimized project where their
credit will be higher.  Every time we optimize SETI@home we get bombarded by 
users
of non-stock optimized apps get angry because their RAC goes down.  That makes 
it a
disincentive to optimize.

Second:  This method encourages projects to create separate apps for GPUs rather
than separate app_versions.  Because GPUs obtain nowhere near their advertised 
rates
for real code, a separate GPU app can earn 20 to 100x the credit of a GPU
app_version of an app that also has CPU app_versions.

Third: It encourages projects to not use the BOINC credit granting mechanisms.  
To
compete with projects that have GPU only apps, some projects grant outrageous 
credit
for everything.





On Tue, Jun 10, 2014 at 11:34 AM, David Anderson <[email protected]
<mailto:[email protected]>> wrote:

    For credit purposes, the standard is peak FLOPS,
    i.e. we give credit for what the device could do,
    rather than what it actually did.
    Among other things, this encourages projects to develop more efficient apps.

    Currently we're not measuring this well for x86 CPUs,
    since our Whetstone benchmark isn't optimized.
    Ideally the BOINC client should include variants for the most common
    CPU features, as we do for ARM.

    -- D


    On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote:

        Before anybody leaps into making any changes on the basis of that 
observation, I
        think we ought to pause and consider why we have a benchmark, and what 
we
        use it for.

        I'd suggest that in an ideal world, we would be measuring the actual 
running
        speed
        of (each project's) science applications on that particular host,
        optimisations and
        all. We gradually do this through the runtime averages anyway, but it's 
hard to
        gather a priori data on a new host.

        Instead of (initially) measuring science application performance, we 
measure
        hardware performance as a surrogate. We now have (at least) three ways 
of
        doing that:

        x86: minimum, most conservative, estimate, no optimisations allowed for.
        Android: allows for optimised hardware pathways with vfp or neon, but
        doesn't relate
        back to science app capability.
        GPU: maximum theoretical 'peak flops', calculated from card parameters, 
then
        scaled
        back by rule of thumb.

        Maybe we should standardise on just one standard?


        
------------------------------__------------------------------__------------------------
             *From:* Richard Haselgrove <[email protected]
        <mailto:[email protected]>>
             *To:* Josef W. Segur <[email protected]
        <mailto:[email protected]>>; David Anderson
             <[email protected] <mailto:[email protected]>>
             *Cc:* "[email protected] 
<mailto:[email protected]>"
        <[email protected] <mailto:[email protected]>>
             *Sent:* Tuesday, 10 June 2014, 9:37
             *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me
        again, but

             please read)

        
http://boinc.berkeley.edu/__gitweb/?p=boinc-v2.git;a=__commit;h=__7b2ca9e787a204f2a57f390bc7249b__b7f9997fea
        
<http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea>

              >__________________________________
              > From: Josef W. Segur <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>
              >To: David Anderson <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>__>>
              >Cc: "[email protected] 
<mailto:[email protected]>
        <mailto:boinc_dev@ssl.__berkeley.edu 
<mailto:[email protected]>>"
             <[email protected] <mailto:[email protected]>
        <mailto:boinc_dev@ssl.__berkeley.edu 
<mailto:[email protected]>>>;
        Eric J Korpela
             <[email protected] <mailto:[email protected]>
        <mailto:[email protected].__edu <mailto:[email protected]>>>;
        Richard Haselgrove
             <[email protected] <mailto:[email protected]>
        <mailto:r.haselgrove@__btopenworld.com 
<mailto:[email protected]>>>

              >Sent: Tuesday, 10 June 2014, 2:19
              >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me
        again, but
             please read)
              >
              >
              >Consider Richard's observation:
              >
              >>>     It appears that the Android Whetstone benchmark used in 
the BOINC
             client has
              >>>     separate code paths for ARM, vfp, and NEON processors: a 
vfp
        or NEON
             processor
              >>>     will report that it is significantly faster than a
        plain-vanilla ARM.
              >
              >If that is so, it distinctly differs from the x86 Whetstone which
        never uses
             SIMD, and is truly conservative as you would want for 3).
              >--
              >                               Joe
              >
              >
              >
              >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson
        <[email protected] <mailto:[email protected]>
             <mailto:[email protected] <mailto:[email protected]>__>> 
wrote:
              >
              >> Eric:
              >>
              >> Yes, I suspect that's what's going on.
              >> Currently the logic for estimating job runtime
              >> (estimate_flops() in sched_version.cpp) is
              >> 1) if this (host, app version) has > 10 results, use (host, app
        version)
             statistics
              >> 2) if this app version has > 100 results, use app version 
statistics
              >> 3) else use a conservative estimate based on p_fpops.
              >>
              >> I'm not sure we should be doing 2) at all,
              >> since as you point out the first x100 or 1000 results for an 
app
        version
              >> will generally be from the fastest devices
              >> (and even in the steady state,
              >> app version statistics disproportionately reflect fast 
devices).
              >>
              >> I'll make this change.
              >>
              >> -- David
              >>
              >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote:
              >>> I also don't have direct access to the server as well, so I'm
        mostly guessing.
              >>> Having separate benchmarks for neon and VFP means there's a 
broad
        bimodal
              >>> distribution for the benchmark results.  Where the mean falls
        depends upon
             the mix
              >>> of machines.  In general the neon machines (being newer and
        faster) will report
              >>> first and more often, so early on the PFC distribution will
        reflect the fast
              >>> machines.  Slower machines will be underweighted.  So the 
work will be
             estimated to
              >>> complete quickly, and some machines will time out.  In SETI 
beta, it
             resolves itself
              >>> in a few weeks.  I can't guarantee that it will anywhere else.
              >>>
              >>> We see this with every release of a GPU app.  The real
        capabilities of graphics
              >>> cards vary by orders of magnitude from the estimate and by 
more
        from each
             other.
              >>> The fast cards report first and most every else hits days of 
timeouts.
              >>>
              >>> One possible fix so to increase the timeout limits for the 
first 10
             workunits for a
              >>> host_app_version, until host based estimates take over.
              >>>
              >>>
              >>>
              >>>
              >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove
             <[email protected] <mailto:[email protected]>
        <mailto:r.haselgrove@__btopenworld.com 
<mailto:[email protected]>>
              >>> <mailto:r.haselgrove@__btopenworld.com
        <mailto:[email protected]>

             <mailto:r.haselgrove@__btopenworld.com
        <mailto:[email protected]>>>> wrote:
              >>>
              >>>     I think Eric Korpela would be the best person to answer 
that
        question,
             but I
              >>>     suspect 'probably not': further investigation over the 
weekend
        suggests
             that the
              >>>     circumstances may be SIMAP-specific.
              >>>
              >>>     It appears that the Android Whetstone benchmark used in 
the BOINC
             client has
              >>>     separate code paths for ARM, vfp, and NEON processors: a 
vfp
        or NEON
             processor
              >>>     will report that it is significantly faster than a
        plain-vanilla ARM.
              >>>
              >>>     However, SIMAP have only deployed a single Android app, 
which I'm
             assuming only
              >>>     uses ARM functions: devices with vfp or NEON SIMD 
vectorisation
             available would
              >>>     run the non-optimised application much slower than BOINC 
expects.
              >>>
              >>>     At my suggestion, Thomas Rattei (SIMAP admistrator) 
increased the
              >>>     rsc_fpops_bound multiplier to 10x on Sunday afternoon. I 
note
        that the
             maximum
              >>>     runtime displayed on
        http://boincsimap.org/__boincsimap/server_status.php
        <http://boincsimap.org/boincsimap/server_status.php> has
              >>>     already increased from 11 hours to 14 hours since he did 
that.
              >>>
              >>>     Thomas has told me "We've seen that 
[EXIT_TIME_LIMIT_EXCEEDED]
        a lot.
             However,
              >>>     due to Samsung PowerSleep, we thought these are mainly 
"lazy"
        users
             just not
              >>>     using their phone regularly for computing." He's going to
        monitor how this
              >>>     progresses during the remainder of the current batch, and 
I've
        asked
             him to keep
              >>>     us updated on his observations.
              >>>
              >>>
              >>>
              >>>      >__________________________________
              >>>      > From: David Anderson <[email protected]
        <mailto:[email protected]>
             <mailto:[email protected] <mailto:[email protected]>__>
        <mailto:[email protected] <mailto:[email protected]>

             <mailto:[email protected] <mailto:[email protected]>__>>>
              >>>      >To: [email protected]
        <mailto:[email protected]> <mailto:boinc_dev@ssl.__berkeley.edu
        <mailto:[email protected]>>
             <mailto:boinc_dev@ssl.__berkeley.edu
        <mailto:[email protected]> <mailto:boinc_dev@ssl.__berkeley.edu
        <mailto:[email protected]>>>

              >>>     >Sent: Monday, 9 June 2014, 3:48
              >>>      >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED 
(sorry, yes me
             again, but
              >>>     please read)
              >>>      >
              >>>      >
              >>>      >Does this problem occur on SETI@home?
              >>>      >-- David
              >>>      >
              >>>      >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote:
              >>>      >
              >>>      >> 2) Android runtime estimates
              >>>      >>
              >>>      >> The example here is from SIMAP. During a recent pause 
between
             batches, I noticed
              >>>      >> that some of my 'pending validation' tasks were being 
slow
        to clear:
              >>>      >>
        http://boincsimap.org/__boincsimap/results.php?hostid=__349248
        <http://boincsimap.org/boincsimap/results.php?hostid=349248>
              >>>      >>
              >>>      >> The clearest example is the third of those three 
workunits:
              >>>      >>
        http://boincsimap.org/__boincsimap/workunit.php?wuid=__57169928
        <http://boincsimap.org/boincsimap/workunit.php?wuid=57169928>
              >>>      >>
              >>>      >> Four of the seven replications have failed with 
'Error while
             computing', and
              >>>      >> every one of those four is an 
EXIT_TIME_LIMIT_EXCEEDED on an
             Android device.
              >>>      >>
              >>>      >> Three of the four hosts have never returned a valid 
result
        (total
             credit zero),
              >>>      >> so they have never had a chance to establish an APR 
for
        use in runtime
              >>>      >> estimation: runtime estimates and bounds must have 
been
        generated
             by the server.
              >>>      >>
              >>>      >> It seems - from these results, and others I've found
        pending on
             other machines -
              >>>      >> that SIMAP tasks on Android are aborted with
             EXIT_TIME_LIMIT_EXCEEDED after ~6
              >>>      >> hours elapsed. For the new batch released today, 
SIMAP are
        using a
             3x bound
              >>>      >> (which may be a bit low under the circumstances):
              >>>      >>
              >>>      >> <rsc_fpops_est>13500000000000.__000000</rsc_fpops_est>
              >>>     >> 
<rsc_fpops_bound>__40500000000000.000000</rsc___fpops_bound>
              >>>      >>
              >>>      >> so I deduce that the tasks when first issued had a 
runtime
        estimate
             of ~2 hours.
              >>>      >>
              >>>      >> My own tasks, on a fast Intel i5 'Haswell' CPU (APR 
7.34
        GFLOPS),
             take over half
              >>>      >> an hour to complete: two hours for an ARM device 
sounds
             suspiciously low. The
              >>>      >> only one of my Android wingmates to have registered 
an APR
              >>>      >>

        
(http://boincsimap.org/__boincsimap/host_app_versions.__php?hostid=771033
        <http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033>) 
is
              >>>     showing
              >>>      >> 1.69 GFLOPS, but I have no way of knowing whether 
that APR was
             established
              >>>     before
              >>>      >> or after the task in question errored out.
              >>>      >>
              >>>      >> From experience - borne out by current tests at
        Albert@Home, where
             server logs
              >>>      >> are helpfully exposed to the public - initial server
        estimates can
             be hopelessly
              >>>      >> over-optimistic. These two are for the same machine:
              >>>      >>
              >>>      >> 2014-06-04 20:28:09.8459 [PID=26529] [version] 
[AV#716]
             (BRP4G-cuda32-nv301)
              >>>      >> adjusting projected flops based on PFC avg: 2124.60G
        2014-06-07
             09:30:56.1506
              >>>      >> [PID=10808] [version] [AV#716] (BRP4G-cuda32-nv301) 
setting
             projected flops
              >>>     based
              >>>      >> on host elapsed time avg: 23.71G
              >>>      >>
              >>>      >> Since SIMAP have recently announced that they are 
leaving
        the BOINC
             platform at
              >>>      >> the end of the year (despite being an Android launch
        partner with
             Samsung), I
              >>>      >> doubt they'll want to put much effort into researching
        this issue.
              >>>      >>
              >>>      >> But if other projects experimenting with Android
        applications are
             experiencing a
              >>>      >> high task failure rate, they might like to check 
whether
              >>>     EXIT_TIME_LIMIT_EXCEEDED
              >>>      >> is a significant factor in those failures, and if so,
        consider the
             other
              >>>      >> remediation approaches (apart from outliers, which 
isn't
        relevant
             in this case)
              >>>      >> that I suggested to Eric Mcintosh at LHC.
              >
              >
              >
             _________________________________________________
             boinc_dev mailing list
        [email protected] <mailto:[email protected]>
        <mailto:boinc_dev@ssl.__berkeley.edu 
<mailto:[email protected]>>

        http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
        <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev>
             To unsubscribe, visit the above URL and
             (near bottom of page) enter your email address.


    _________________________________________________
    boinc_dev mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev
    <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev>
    To unsubscribe, visit the above URL and
    (near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, but please read)

Reply via email to