The problem with RAC is the exponential decay as a function of time. It is just like real life: It may take most of a lifetime to become a VP, and seconds to get fired if you screw up, or one of your subordinates does it for you. Likewise, if the user's computer has a problem and it does not submit any results for a day or so, the slow dribble of pending credits being resolved will lower the RAC considerably.
> -----Original Message----- > From: boinc_dev [mailto:[email protected]] On Behalf > Of McLeod, John > Sent: Thursday, June 12, 2014 9:24 AM > To: Richard Haselgrove; Eric J Korpela; David Anderson > Cc: [email protected]; Josef W. Segur > Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, > but please read) > > The reason RAC was not used was because of the delay in granting > credit. However, if we remember that RAC is long term, then it makes > more sense. If we use RAC as the estimate, then we should also have > the client attempt to contact each of the servers for an update of RAC > occasionally, but it would not have to be very often - once a week or > so would probably suffice. If it had been more than a week since the > last connection, it would be time to try again. > > -----Original Message----- > From: boinc_dev [mailto:[email protected]] On Behalf > Of Richard Haselgrove > Sent: Wednesday, June 11, 2014 4:23 PM > To: Eric J Korpela; David Anderson > Cc: [email protected]; Josef W. Segur > Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, > but please read) > > That one made it as far as the planning document, but no further: > > http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen#Proposal:credit- > drivenscheduling > > > The surrogate, REC, is essentially speed * time, or back to square one. > > > > >________________________________ > > From: Eric J Korpela <[email protected]> > >To: David Anderson <[email protected]> > >Cc: Richard Haselgrove <[email protected]>; Josef W. Segur > <[email protected]>; "[email protected]" > <[email protected]> > >Sent: Wednesday, 11 June 2014, 21:03 > >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me > again, but please read) > > > > > > > >Another possibility that came to me years ago would be to use RAC > rather than estimated duration to compute the resource allocation on > the client side. That way on a machine running two projects with equal > resource share would end up spending more time running the one with > lower granted credit per unit work. That would encourage projects not > to over grant (they would lose resources) or under grant (they would > lose volunteers). > > > > > > > > > >On Tue, Jun 10, 2014 at 1:03 PM, Eric J Korpela > <[email protected]> wrote: > > > > > >>I haven't thought about it in a while. I had come up with a stable > system that would but it wasn't simple and it also required projects to > voluntarily participate. Therefore it wouldn't have worked. > >> > >>The only thought I've had recently is to have a "calibration" plan > class that has a non-SIMD non-threaded unoptimized CPU-only app_version > that gets sent out once out of every N (~100,000) results. This (as the > least efficient app_version) could set the pfc_scale. Again, it would > require project participation, so it wouldn't work. > >> > >>So I spend most of my time trying not to think about it. > >> > >> > >> > >> > >> > >>On Tue, Jun 10, 2014 at 12:12 PM, David Anderson > <[email protected]> wrote: > >> > >>Are you saying we're taking the wrong approach? > >>>Any other suggestions? > >>> > >>> > >>>On 10-Jun-2014 11:51 AM, Eric J Korpela wrote: > >>> > >>> >For credit purposes, the standard is peak FLOPS, > >>>> >i.e. we give credit for what the device could do, > >>>> >rather than what it actually did. > >>>> >Among other things, this encourages projects to develop more > efficient apps. > >>>> > >>>>It does the opposite because many projects care more about > attracting volunteers > >>>>than they do about efficient computation. > >>>> > >>>>First: Per second of run time, a host gets the same credit for a > non-optimized > >>>>stock app as it does for an optimized stock app. There's no > benefit to the > >>>>volunteer to go to a project with optimized apps. In fact there's > a benefit for > >>>>users to compile an optimized app for use at a non-optimized > project where their > >>>>credit will be higher. Every time we optimize SETI@home we get > bombarded by users > >>>>of non-stock optimized apps get angry because their RAC goes down. > That makes it a > >>>>disincentive to optimize. > >>>> > >>>>Second: This method encourages projects to create separate apps > for GPUs rather > >>>>than separate app_versions. Because GPUs obtain nowhere near their > advertised rates > >>>>for real code, a separate GPU app can earn 20 to 100x the credit of > a GPU > >>>>app_version of an app that also has CPU app_versions. > >>>> > >>>>Third: It encourages projects to not use the BOINC credit granting > mechanisms. To > >>>>compete with projects that have GPU only apps, some projects grant > outrageous credit > >>>>for everything. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>On Tue, Jun 10, 2014 at 11:34 AM, David Anderson > <[email protected] > >>>> > >>>><mailto:[email protected]>> wrote: > >>>> > >>>> For credit purposes, the standard is peak FLOPS, > >>>> i.e. we give credit for what the device could do, > >>>> rather than what it actually did. > >>>> Among other things, this encourages projects to develop more > efficient apps. > >>>> > >>>> Currently we're not measuring this well for x86 CPUs, > >>>> since our Whetstone benchmark isn't optimized. > >>>> Ideally the BOINC client should include variants for the most > common > >>>> CPU features, as we do for ARM. > >>>> > >>>> -- D > >>>> > >>>> > >>>> On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote: > >>>> > >>>> Before anybody leaps into making any changes on the basis > of that observation, I > >>>> think we ought to pause and consider why we have a > benchmark, and what we > >>>> use it for. > >>>> > >>>> I'd suggest that in an ideal world, we would be measuring > the actual running > >>>> speed > >>>> of (each project's) science applications on that particular > host, > >>>> optimisations and > >>>> all. We gradually do this through the runtime averages > anyway, but it's hard to > >>>> gather a priori data on a new host. > >>>> > >>>> Instead of (initially) measuring science application > performance, we measure > >>>> hardware performance as a surrogate. We now have (at least) > three ways of > >>>> doing that: > >>>> > >>>> x86: minimum, most conservative, estimate, no optimisations > allowed for. > >>>> Android: allows for optimised hardware pathways with vfp or > neon, but > >>>> doesn't relate > >>>> back to science app capability. > >>>> GPU: maximum theoretical 'peak flops', calculated from card > parameters, then > >>>> scaled > >>>> back by rule of thumb. > >>>> > >>>> Maybe we should standardise on just one standard? > >>>> > >>>> > >>>> > ------------------------------__------------------------------ > __------------------------ > >>>> > >>>> *From:* Richard Haselgrove > <[email protected] > >>>> > <mailto:[email protected]>> > >>>> > >>>> *To:* Josef W. Segur <[email protected] > >>>> > <mailto:[email protected]>>; David Anderson > >>>> <[email protected] > <mailto:[email protected]>> > >>>> *Cc:* "[email protected] > <mailto:[email protected]>" > >>>> <[email protected] > <mailto:[email protected]>> > >>>> > >>>> *Sent:* Tuesday, 10 June 2014, 9:37 > >>>> *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED > (sorry, yes me > >>>> again, but > >>>> > >>>> please read) > >>>> > >>>> > http://boinc.berkeley.edu/__gitweb/?p=boinc- > v2.git;a=__commit;h=__7b2ca9e787a204f2a57f390bc7249b__b7f9997fea > >>>> <http://boinc.berkeley.edu/gitweb/?p=boinc- > v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea> > >>>> > >>>> >__________________________________ > >>>> > >>>> > From: Josef W. Segur <[email protected] > >>>> > <mailto:[email protected]> <mailto:[email protected] > >>>> > >>>> <mailto:[email protected]>>> > >>>> >To: David Anderson <[email protected] > >>>> > <mailto:[email protected]> <mailto:[email protected] > >>>> <mailto:[email protected]>__>> > >>>> >Cc: "[email protected] > <mailto:[email protected]> > >>>> <mailto:boinc_dev@ssl.__berkeley.edu > <mailto:[email protected]>>" > >>>> <[email protected] > <mailto:[email protected]> > >>>> <mailto:boinc_dev@ssl.__berkeley.edu > <mailto:[email protected]>>>; > >>>> > >>>> Eric J Korpela > >>>> <[email protected] > <mailto:[email protected]> > >>>> > <mailto:[email protected].__edu > <mailto:[email protected]>>>; > >>>> > >>>> Richard Haselgrove > >>>> <[email protected] > <mailto:[email protected]> > >>>> > <mailto:r.haselgrove@__btopenworld.com > <mailto:[email protected]>>> > >>>> > >>>> > >>>> >Sent: Tuesday, 10 June 2014, 2:19 > >>>> >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED > (sorry, yes me > >>>> again, but > >>>> please read) > >>>> > > >>>> > > >>>> >Consider Richard's observation: > >>>> > > >>>> >>> It appears that the Android Whetstone > benchmark used in the BOINC > >>>> client has > >>>> >>> separate code paths for ARM, vfp, and NEON > processors: a vfp > >>>> or NEON > >>>> processor > >>>> >>> will report that it is significantly faster > than a > >>>> plain-vanilla ARM. > >>>> > > >>>> >If that is so, it distinctly differs from the x86 > Whetstone which > >>>> never uses > >>>> SIMD, and is truly conservative as you would want for > 3). > >>>> >-- > >>>> > Joe > >>>> > > >>>> > > >>>> > > >>>> >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson > >>>> <[email protected] <mailto:[email protected]> > >>>> > >>>> <mailto:[email protected] > <mailto:[email protected]>__>> wrote: > >>>> > > >>>> >> Eric: > >>>> >> > >>>> >> Yes, I suspect that's what's going on. > >>>> >> Currently the logic for estimating job runtime > >>>> >> (estimate_flops() in sched_version.cpp) is > >>>> >> 1) if this (host, app version) has > 10 results, > use (host, app > >>>> version) > >>>> statistics > >>>> >> 2) if this app version has > 100 results, use app > version statistics > >>>> >> 3) else use a conservative estimate based on > p_fpops. > >>>> >> > >>>> >> I'm not sure we should be doing 2) at all, > >>>> >> since as you point out the first x100 or 1000 > results for an app > >>>> version > >>>> >> will generally be from the fastest devices > >>>> >> (and even in the steady state, > >>>> >> app version statistics disproportionately reflect > fast devices). > >>>> >> > >>>> >> I'll make this change. > >>>> >> > >>>> >> -- David > >>>> >> > >>>> >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote: > >>>> >>> I also don't have direct access to the server as > well, so I'm > >>>> mostly guessing. > >>>> >>> Having separate benchmarks for neon and VFP means > there's a broad > >>>> bimodal > >>>> >>> distribution for the benchmark results. Where > the mean falls > >>>> depends upon > >>>> the mix > >>>> >>> of machines. In general the neon machines (being > newer and > >>>> faster) will report > >>>> >>> first and more often, so early on the PFC > distribution will > >>>> reflect the fast > >>>> >>> machines. Slower machines will be underweighted. > So the work will be > >>>> estimated to > >>>> >>> complete quickly, and some machines will time > out. In SETI beta, it > >>>> resolves itself > >>>> >>> in a few weeks. I can't guarantee that it will > anywhere else. > >>>> >>> > >>>> >>> We see this with every release of a GPU app. The > real > >>>> capabilities of graphics > >>>> >>> cards vary by orders of magnitude from the > estimate and by more > >>>> from each > >>>> other. > >>>> >>> The fast cards report first and most every else > hits days of timeouts. > >>>> >>> > >>>> >>> One possible fix so to increase the timeout > limits for the first 10 > >>>> workunits for a > >>>> >>> host_app_version, until host based estimates take > over. > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard > Haselgrove > >>>> <[email protected] > <mailto:[email protected]> > >>>> > <mailto:r.haselgrove@__btopenworld.com > <mailto:[email protected]>> > >>>> >>> <mailto:r.haselgrove@__btopenworld.com > >>>> <mailto:[email protected]> > >>>> > >>>> <mailto:r.haselgrove@__btopenworld.com > >>>> > >>>> <mailto:[email protected]>>>> wrote: > >>>> >>> > >>>> >>> I think Eric Korpela would be the best person > to answer that > >>>> question, > >>>> but I > >>>> >>> suspect 'probably not': further investigation > over the weekend > >>>> suggests > >>>> that the > >>>> >>> circumstances may be SIMAP-specific. > >>>> >>> > >>>> >>> It appears that the Android Whetstone > benchmark used in the BOINC > >>>> client has > >>>> >>> separate code paths for ARM, vfp, and NEON > processors: a vfp > >>>> or NEON > >>>> processor > >>>> >>> will report that it is significantly faster > than a > >>>> plain-vanilla ARM. > >>>> >>> > >>>> >>> However, SIMAP have only deployed a single > Android app, which I'm > >>>> assuming only > >>>> >>> uses ARM functions: devices with vfp or NEON > SIMD vectorisation > >>>> available would > >>>> >>> run the non-optimised application much slower > than BOINC expects. > >>>> >>> > >>>> >>> At my suggestion, Thomas Rattei (SIMAP > admistrator) increased the > >>>> >>> rsc_fpops_bound multiplier to 10x on Sunday > afternoon. I note > >>>> that the > >>>> maximum > >>>> >>> runtime displayed on > >>>> > http://boincsimap.org/__boincsimap/server_status.php > >>>> > >>>> <http://boincsimap.org/boincsimap/server_status.php> has > >>>> >>> already increased from 11 hours to 14 hours > since he did that. > >>>> >>> > >>>> >>> Thomas has told me "We've seen that > [EXIT_TIME_LIMIT_EXCEEDED] > >>>> a lot. > >>>> However, > >>>> >>> due to Samsung PowerSleep, we thought these > are mainly "lazy" > >>>> users > >>>> just not > >>>> >>> using their phone regularly for computing." > He's going to > >>>> monitor how this > >>>> >>> progresses during the remainder of the > current batch, and I've > >>>> asked > >>>> him to keep > >>>> >>> us updated on his observations. > >>>> >>> > >>>> >>> > >>>> >>> > >>>> > >>> >__________________________________ > >>>> > >>>> >>> > From: David Anderson > <[email protected] > >>>> <mailto:[email protected]> > >>>> > <mailto:[email protected] > <mailto:[email protected]>__> > >>>> <mailto:[email protected] > <mailto:[email protected]> > >>>> > >>>> <mailto:[email protected] > <mailto:[email protected]>__>>> > >>>> >>> >To: [email protected] > >>>> <mailto:[email protected]> > <mailto:boinc_dev@ssl.__berkeley.edu > >>>> <mailto:[email protected]>> > >>>> <mailto:boinc_dev@ssl.__berkeley.edu > >>>> <mailto:[email protected]> > <mailto:boinc_dev@ssl.__berkeley.edu > >>>> > >>>> <mailto:[email protected]>>> > >>>> > >>>> >>> >Sent: Monday, 9 June 2014, 3:48 > >>>> >>> >Subject: Re: [boinc_dev] > EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me > >>>> again, but > >>>> >>> please read) > >>>> >>> > > >>>> >>> > > >>>> >>> >Does this problem occur on SETI@home? > >>>> >>> >-- David > >>>> >>> > > >>>> >>> >On 07-Jun-2014 2:51 AM, Richard Haselgrove > wrote: > >>>> >>> > > >>>> >>> >> 2) Android runtime estimates > >>>> >>> >> > >>>> >>> >> The example here is from SIMAP. During a > recent pause between > >>>> batches, I noticed > >>>> >>> >> that some of my 'pending validation' > tasks were being slow > >>>> to clear: > >>>> >>> >> > >>>> > http://boincsimap.org/__boincsimap/results.php?hostid=__349248 > >>>> > >>>> > <http://boincsimap.org/boincsimap/results.php?hostid=349248> > >>>> >>> >> > >>>> >>> >> The clearest example is the third of > those three workunits: > >>>> >>> >> > >>>> > http://boincsimap.org/__boincsimap/workunit.php?wuid=__57169928 > >>>> > >>>> > <http://boincsimap.org/boincsimap/workunit.php?wuid=57169928> > >>>> >>> >> > >>>> >>> >> Four of the seven replications have > failed with 'Error while > >>>> computing', and > >>>> >>> >> every one of those four is an > EXIT_TIME_LIMIT_EXCEEDED on an > >>>> Android device. > >>>> >>> >> > >>>> >>> >> Three of the four hosts have never > returned a valid result > >>>> (total > >>>> credit zero), > >>>> >>> >> so they have never had a chance to > establish an APR for > >>>> use in runtime > >>>> >>> >> estimation: runtime estimates and bounds > must have been > >>>> generated > >>>> by the server. > >>>> >>> >> > >>>> >>> >> It seems - from these results, and others > I've found > >>>> pending on > >>>> other machines - > >>>> >>> >> that SIMAP tasks on Android are aborted > with > >>>> EXIT_TIME_LIMIT_EXCEEDED after ~6 > >>>> >>> >> hours elapsed. For the new batch released > today, SIMAP are > >>>> using a > >>>> 3x bound > >>>> >>> >> (which may be a bit low under the > circumstances): > >>>> >>> >> > >>>> > >>> >> > <rsc_fpops_est>13500000000000.__000000</rsc_fpops_est> > >>>> >>> >> > <rsc_fpops_bound>__40500000000000.000000</rsc___fpops_bound> > >>>> > >>>> >>> >> > >>>> >>> >> so I deduce that the tasks when first > issued had a runtime > >>>> estimate > >>>> of ~2 hours. > >>>> >>> >> > >>>> >>> >> My own tasks, on a fast Intel i5 > 'Haswell' CPU (APR 7.34 > >>>> GFLOPS), > >>>> take over half > >>>> >>> >> an hour to complete: two hours for an ARM > device sounds > >>>> suspiciously low. The > >>>> >>> >> only one of my Android wingmates to have > registered an APR > >>>> >>> >> > >>>> > >>>> > > (http://boincsimap.org/__boincsimap/host_app_versions.__php?hostid=7710 > 33 > >>>> > <http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033>) > is > >>>> > >>>> >>> showing > >>>> >>> >> 1.69 GFLOPS, but I have no way of knowing > whether that APR was > >>>> established > >>>> >>> before > >>>> >>> >> or after the task in question errored > out. > >>>> >>> >> > >>>> >>> >> From experience - borne out by current > tests at > >>>> Albert@Home, where > >>>> server logs > >>>> >>> >> are helpfully exposed to the public - > initial server > >>>> estimates can > >>>> be hopelessly > >>>> >>> >> over-optimistic. These two are for the > same machine: > >>>> >>> >> > >>>> >>> >> 2014-06-04 20:28:09.8459 [PID=26529] > [version] [AV#716] > >>>> (BRP4G-cuda32-nv301) > >>>> >>> >> adjusting projected flops based on PFC > avg: 2124.60G > >>>> 2014-06-07 > >>>> 09:30:56.1506 > >>>> >>> >> [PID=10808] [version] [AV#716] (BRP4G- > cuda32-nv301) setting > >>>> projected flops > >>>> >>> based > >>>> >>> >> on host elapsed time avg: 23.71G > >>>> >>> >> > >>>> >>> >> Since SIMAP have recently announced that > they are leaving > >>>> the BOINC > >>>> platform at > >>>> >>> >> the end of the year (despite being an > Android launch > >>>> partner with > >>>> Samsung), I > >>>> >>> >> doubt they'll want to put much effort > into researching > >>>> this issue. > >>>> >>> >> > >>>> >>> >> But if other projects experimenting with > Android > >>>> applications are > >>>> experiencing a > >>>> >>> >> high task failure rate, they might like > to check whether > >>>> >>> EXIT_TIME_LIMIT_EXCEEDED > >>>> >>> >> is a significant factor in those > failures, and if so, > >>>> consider the > >>>> other > >>>> >>> >> remediation approaches (apart from > outliers, which isn't > >>>> relevant > >>>> in this case) > >>>> >>> >> that I suggested to Eric Mcintosh at LHC. > >>>> > > >>>> > > >>>> > > >>>> > _________________________________________________ > >>>> > >>>> boinc_dev mailing list > >>>> [email protected] > <mailto:[email protected]> > >>>> > <mailto:boinc_dev@ssl.__berkeley.edu > <mailto:[email protected]>> > >>>> > >>>> http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev > >>>> > >>>> <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> > >>>> To unsubscribe, visit the above URL and > >>>> (near bottom of page) enter your email address. > >>>> > >>>> > >>>> > _________________________________________________ > >>>> > >>>> boinc_dev mailing list > >>>> [email protected] <mailto:[email protected]> > >>>> > http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev > >>>> > >>>> <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> > >>>> To unsubscribe, visit the above URL and > >>>> (near bottom of page) enter your email address. > >>>> > >>>> > >>>> > >> > > > > > > > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
