Wasn't the fundamental problem being attacked the constant credit inflation due to architectural improvements in CPUs and GPUs? It is like inflation; the value of "credits in the bank, i.e., in the database" become worth less due factors people cannot control. I don't know of any way of doing this except by reducing the credit allocated per FLOP.
Charles Elliott > -----Original Message----- > From: boinc_dev [mailto:[email protected]] On Behalf > Of Raistmer the Sorcerer > Sent: Tuesday, June 10, 2014 3:52 PM > To: David Anderson > Cc: [email protected]; Richard Haselgrove; Josef W. Segur > Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me again, > but please read) > > Current approach to credit accounting is definitely wrong. Whole SETI > forums discuss how much it's wrong many months already. It's almost > impossible to avoid this topic if one ever come there. > > Some suggestions could be: > To recall why those credits are needed for BOINC at all. Correct answer > is to ATTRACT participanst exploiting HUMAN competitive nature. Not to > measure anything, it's social engineering first of all! > > From this approach some conclusions could be done. > It's in human nature to get angry being "less paid". Hence - NEVER > deflate credits ! Inflation - no probs, peoples like to get more, but > NEVER decrease amount of granting by any reason. > > And that's exactly whit we get with current system. > We working hard to optimize SETI code. Then we release app. All users > who installed it are happy - it works faster, they get MORE credits > with the SAME hardware. > Then, being interesting in project we trying to incorporate found > optimizations in project stock app. Finally new stockj app released... > And whole mess begins. Users of stock app notice nothing - their credit > remains the same. But THE MOST active users, that going into troubles > to install opt apps, to go to anonymous platform and so on (let say > biggest project fans) instantly get pissed off. Their RAC starts to > drop ! WHY?! Because some "idiots" decided to improve stock app??! And > flame wars on forums begins. > All this thing absolutely not about how scientifically correct you guys > account for FLOPS being done, it's about keeping PARTICIPANTS who > donate resources HAPPY. And current CreditScrew gives absolutely > diametral feelings both to participants AND developers. > One would say quite impressive outcome... > > What could be suggested for further discussion: try to calibrate not on > stock app but on fastest app (usually this will mean anonymous platform > app, btw) correctly computing app in the project. > That they even if some credits will be decreased (though additional > considerations should be done to avoid ANY drop in RAC because of any > software replacement in stock) they would be decreased for stock users. > This would > 1) stimulate users to install fastest app. > 2) stimulate project to incorporate fastest algorithms in their stock > app. > > > > Tue, 10 Jun 2014 12:12:24 -0700 от David Anderson > <[email protected]>: > >Are you saying we're taking the wrong approach? > >Any other suggestions? > > > >On 10-Jun-2014 11:51 AM, Eric J Korpela wrote: > >> >For credit purposes, the standard is peak FLOPS, > >> >i.e. we give credit for what the device could do, > >> >rather than what it actually did. > >> >Among other things, this encourages projects to develop more > efficient apps. > >> > >> It does the opposite because many projects care more about > attracting volunteers > >> than they do about efficient computation. > >> > >> First: Per second of run time, a host gets the same credit for a > non-optimized > >> stock app as it does for an optimized stock app. There's no benefit > to the > >> volunteer to go to a project with optimized apps. In fact there's a > benefit for > >> users to compile an optimized app for use at a non-optimized project > where their > >> credit will be higher. Every time we optimize SETI@home we get > bombarded by users > >> of non-stock optimized apps get angry because their RAC goes down. > That makes it a > >> disincentive to optimize. > >> > >> Second: This method encourages projects to create separate apps for > GPUs rather > >> than separate app_versions. Because GPUs obtain nowhere near their > advertised rates > >> for real code, a separate GPU app can earn 20 to 100x the credit of > a GPU > >> app_version of an app that also has CPU app_versions. > >> > >> Third: It encourages projects to not use the BOINC credit granting > mechanisms. To > >> compete with projects that have GPU only apps, some projects grant > outrageous credit > >> for everything. > >> > >> > >> > >> > >> > >> On Tue, Jun 10, 2014 at 11:34 AM, David Anderson < > [email protected] > >> <mailto: [email protected] >> wrote: > >> > >> For credit purposes, the standard is peak FLOPS, > >> i.e. we give credit for what the device could do, > >> rather than what it actually did. > >> Among other things, this encourages projects to develop more > efficient apps. > >> > >> Currently we're not measuring this well for x86 CPUs, > >> since our Whetstone benchmark isn't optimized. > >> Ideally the BOINC client should include variants for the most > common > >> CPU features, as we do for ARM. > >> > >> -- D > >> > >> > >> On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote: > >> > >> Before anybody leaps into making any changes on the basis of > that observation, I > >> think we ought to pause and consider why we have a > benchmark, and what we > >> use it for. > >> > >> I'd suggest that in an ideal world, we would be measuring > the actual running > >> speed > >> of (each project's) science applications on that particular > host, > >> optimisations and > >> all. We gradually do this through the runtime averages > anyway, but it's hard to > >> gather a priori data on a new host. > >> > >> Instead of (initially) measuring science application > performance, we measure > >> hardware performance as a surrogate. We now have (at least) > three ways of > >> doing that: > >> > >> x86: minimum, most conservative, estimate, no optimisations > allowed for. > >> Android: allows for optimised hardware pathways with vfp or > neon, but > >> doesn't relate > >> back to science app capability. > >> GPU: maximum theoretical 'peak flops', calculated from card > parameters, then > >> scaled > >> back by rule of thumb. > >> > >> Maybe we should standardise on just one standard? > >> > >> > >> ------------------------------__---------------------------- > --__------------------------ > >> *From:* Richard Haselgrove < > [email protected] > >> <mailto: [email protected] >> > >> *To:* Josef W. Segur < [email protected] > >> <mailto: [email protected] >>; David Anderson > >> < [email protected] <mailto: > [email protected] >> > >> *Cc:* " [email protected] <mailto: > [email protected] >" > >> < [email protected] <mailto: > [email protected] >> > >> *Sent:* Tuesday, 10 June 2014, 9:37 > >> *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED > (sorry, yes me > >> again, but > >> > >> please read) > >> > >> http://boinc.berkeley.edu/__gitweb/?p=boinc- > v2.git;a=__commit;h=__7b2ca9e787a204f2a57f390bc7249b__b7f9997fea > >> < http://boinc.berkeley.edu/gitweb/?p=boinc- > v2.git;a=commit;h=7b2ca9e787a204f2a57f390bc7249bb7f9997fea > > >> > >> >__________________________________ > >> > From: Josef W. Segur < [email protected] > >> <mailto: [email protected] > <mailto: > [email protected] > >> <mailto: [email protected] >>> > >> >To: David Anderson < [email protected] > >> <mailto: [email protected] > <mailto: > [email protected] > >> <mailto: [email protected] >__>> > >> >Cc: " [email protected] <mailto: > [email protected] > > >> <mailto: boinc_dev@ssl.__berkeley.edu <mailto: > [email protected] >>" > >> < [email protected] <mailto: > [email protected] > > >> <mailto: boinc_dev@ssl.__berkeley.edu <mailto: > [email protected] >>>; > >> Eric J Korpela > >> < [email protected] <mailto: > [email protected] > > >> <mailto: [email protected].__edu <mailto: > [email protected] >>>; > >> Richard Haselgrove > >> < [email protected] <mailto: > [email protected] > > >> <mailto: r.haselgrove@__btopenworld.com <mailto: > [email protected] >>> > >> > >> >Sent: Tuesday, 10 June 2014, 2:19 > >> >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED > (sorry, yes me > >> again, but > >> please read) > >> > > >> > > >> >Consider Richard's observation: > >> > > >> >>> It appears that the Android Whetstone > benchmark used in the BOINC > >> client has > >> >>> separate code paths for ARM, vfp, and NEON > processors: a vfp > >> or NEON > >> processor > >> >>> will report that it is significantly faster > than a > >> plain-vanilla ARM. > >> > > >> >If that is so, it distinctly differs from the x86 > Whetstone which > >> never uses > >> SIMD, and is truly conservative as you would want for > 3). > >> >-- > >> > Joe > >> > > >> > > >> > > >> >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson > >> < [email protected] <mailto: [email protected] > > >> <mailto: [email protected] <mailto: > [email protected] >__>> wrote: > >> > > >> >> Eric: > >> >> > >> >> Yes, I suspect that's what's going on. > >> >> Currently the logic for estimating job runtime > >> >> (estimate_flops() in sched_version.cpp) is > >> >> 1) if this (host, app version) has > 10 results, > use (host, app > >> version) > >> statistics > >> >> 2) if this app version has > 100 results, use app > version statistics > >> >> 3) else use a conservative estimate based on > p_fpops. > >> >> > >> >> I'm not sure we should be doing 2) at all, > >> >> since as you point out the first x100 or 1000 > results for an app > >> version > >> >> will generally be from the fastest devices > >> >> (and even in the steady state, > >> >> app version statistics disproportionately reflect > fast devices). > >> >> > >> >> I'll make this change. > >> >> > >> >> -- David > >> >> > >> >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote: > >> >>> I also don't have direct access to the server as > well, so I'm > >> mostly guessing. > >> >>> Having separate benchmarks for neon and VFP means > there's a broad > >> bimodal > >> >>> distribution for the benchmark results. Where the > mean falls > >> depends upon > >> the mix > >> >>> of machines. In general the neon machines (being > newer and > >> faster) will report > >> >>> first and more often, so early on the PFC > distribution will > >> reflect the fast > >> >>> machines. Slower machines will be underweighted. > So the work will be > >> estimated to > >> >>> complete quickly, and some machines will time out. > In SETI beta, it > >> resolves itself > >> >>> in a few weeks. I can't guarantee that it will > anywhere else. > >> >>> > >> >>> We see this with every release of a GPU app. The > real > >> capabilities of graphics > >> >>> cards vary by orders of magnitude from the > estimate and by more > >> from each > >> other. > >> >>> The fast cards report first and most every else > hits days of timeouts. > >> >>> > >> >>> One possible fix so to increase the timeout limits > for the first 10 > >> workunits for a > >> >>> host_app_version, until host based estimates take > over. > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove > >> < [email protected] <mailto: > [email protected] > > >> <mailto: r.haselgrove@__btopenworld.com <mailto: > [email protected] >> > >> >>> <mailto: r.haselgrove@__btopenworld.com > >> <mailto: [email protected] > > >> > >> <mailto: r.haselgrove@__btopenworld.com > >> <mailto: [email protected] >>>> wrote: > >> >>> > >> >>> I think Eric Korpela would be the best person > to answer that > >> question, > >> but I > >> >>> suspect 'probably not': further investigation > over the weekend > >> suggests > >> that the > >> >>> circumstances may be SIMAP-specific. > >> >>> > >> >>> It appears that the Android Whetstone > benchmark used in the BOINC > >> client has > >> >>> separate code paths for ARM, vfp, and NEON > processors: a vfp > >> or NEON > >> processor > >> >>> will report that it is significantly faster > than a > >> plain-vanilla ARM. > >> >>> > >> >>> However, SIMAP have only deployed a single > Android app, which I'm > >> assuming only > >> >>> uses ARM functions: devices with vfp or NEON > SIMD vectorisation > >> available would > >> >>> run the non-optimised application much slower > than BOINC expects. > >> >>> > >> >>> At my suggestion, Thomas Rattei (SIMAP > admistrator) increased the > >> >>> rsc_fpops_bound multiplier to 10x on Sunday > afternoon. I note > >> that the > >> maximum > >> >>> runtime displayed on > >> http://boincsimap.org/__boincsimap/server_status.php > >> < http://boincsimap.org/boincsimap/server_status.php > has > >> >>> already increased from 11 hours to 14 hours > since he did that. > >> >>> > >> >>> Thomas has told me "We've seen that > [EXIT_TIME_LIMIT_EXCEEDED] > >> a lot. > >> However, > >> >>> due to Samsung PowerSleep, we thought these > are mainly "lazy" > >> users > >> just not > >> >>> using their phone regularly for computing." > He's going to > >> monitor how this > >> >>> progresses during the remainder of the current > batch, and I've > >> asked > >> him to keep > >> >>> us updated on his observations. > >> >>> > >> >>> > >> >>> > >> >>> >__________________________________ > >> >>> > From: David Anderson < > [email protected] > >> <mailto: [email protected] > > >> <mailto: [email protected] <mailto: > [email protected] >__> > >> <mailto: [email protected] <mailto: > [email protected] > > >> > >> <mailto: [email protected] <mailto: > [email protected] >__>>> > >> >>> >To: [email protected] > >> <mailto: [email protected] > <mailto: > boinc_dev@ssl.__berkeley.edu > >> <mailto: [email protected] >> > >> <mailto: boinc_dev@ssl.__berkeley.edu > >> <mailto: [email protected] > <mailto: > boinc_dev@ssl.__berkeley.edu > >> <mailto: [email protected] >>> > >> > >> >>> >Sent: Monday, 9 June 2014, 3:48 > >> >>> >Subject: Re: [boinc_dev] > EXIT_TIME_LIMIT_EXCEEDED (sorry, yes me > >> again, but > >> >>> please read) > >> >>> > > >> >>> > > >> >>> >Does this problem occur on SETI@home? > >> >>> >-- David > >> >>> > > >> >>> >On 07-Jun-2014 2:51 AM, Richard Haselgrove > wrote: > >> >>> > > >> >>> >> 2) Android runtime estimates > >> >>> >> > >> >>> >> The example here is from SIMAP. During a > recent pause between > >> batches, I noticed > >> >>> >> that some of my 'pending validation' tasks > were being slow > >> to clear: > >> >>> >> > >> http://boincsimap.org/__boincsimap/results.php?hostid=__349248 > >> < http://boincsimap.org/boincsimap/results.php?hostid=349248 > > > >> >>> >> > >> >>> >> The clearest example is the third of those > three workunits: > >> >>> >> > >> http://boincsimap.org/__boincsimap/workunit.php?wuid=__57169928 > >> < > http://boincsimap.org/boincsimap/workunit.php?wuid=57169928 > > >> >>> >> > >> >>> >> Four of the seven replications have failed > with 'Error while > >> computing', and > >> >>> >> every one of those four is an > EXIT_TIME_LIMIT_EXCEEDED on an > >> Android device. > >> >>> >> > >> >>> >> Three of the four hosts have never > returned a valid result > >> (total > >> credit zero), > >> >>> >> so they have never had a chance to > establish an APR for > >> use in runtime > >> >>> >> estimation: runtime estimates and bounds > must have been > >> generated > >> by the server. > >> >>> >> > >> >>> >> It seems - from these results, and others > I've found > >> pending on > >> other machines - > >> >>> >> that SIMAP tasks on Android are aborted > with > >> EXIT_TIME_LIMIT_EXCEEDED after ~6 > >> >>> >> hours elapsed. For the new batch released > today, SIMAP are > >> using a > >> 3x bound > >> >>> >> (which may be a bit low under the > circumstances): > >> >>> >> > >> >>> >> > <rsc_fpops_est>13500000000000.__000000</rsc_fpops_est> > >> >>> >> > <rsc_fpops_bound>__40500000000000.000000</rsc___fpops_bound> > >> >>> >> > >> >>> >> so I deduce that the tasks when first > issued had a runtime > >> estimate > >> of ~2 hours. > >> >>> >> > >> >>> >> My own tasks, on a fast Intel i5 'Haswell' > CPU (APR 7.34 > >> GFLOPS), > >> take over half > >> >>> >> an hour to complete: two hours for an ARM > device sounds > >> suspiciously low. The > >> >>> >> only one of my Android wingmates to have > registered an APR > >> >>> >> > >> > >> ( > http://boincsimap.org/__boincsimap/host_app_versions.__php?hostid=77103 > 3 > >> < > http://boincsimap.org/boincsimap/host_app_versions.php?hostid=771033 >) > is > >> >>> showing > >> >>> >> 1.69 GFLOPS, but I have no way of knowing > whether that APR was > >> established > >> >>> before > >> >>> >> or after the task in question errored out. > >> >>> >> > >> >>> >> From experience - borne out by current > tests at > >> Albert@Home, where > >> server logs > >> >>> >> are helpfully exposed to the public - > initial server > >> estimates can > >> be hopelessly > >> >>> >> over-optimistic. These two are for the > same machine: > >> >>> >> > >> >>> >> 2014-06-04 20:28:09.8459 [PID=26529] > [version] [AV#716] > >> (BRP4G-cuda32-nv301) > >> >>> >> adjusting projected flops based on PFC > avg: 2124.60G > >> 2014-06-07 > >> 09:30:56.1506 > >> >>> >> [PID=10808] [version] [AV#716] (BRP4G- > cuda32-nv301) setting > >> projected flops > >> >>> based > >> >>> >> on host elapsed time avg: 23.71G > >> >>> >> > >> >>> >> Since SIMAP have recently announced that > they are leaving > >> the BOINC > >> platform at > >> >>> >> the end of the year (despite being an > Android launch > >> partner with > >> Samsung), I > >> >>> >> doubt they'll want to put much effort into > researching > >> this issue. > >> >>> >> > >> >>> >> But if other projects experimenting with > Android > >> applications are > >> experiencing a > >> >>> >> high task failure rate, they might like to > check whether > >> >>> EXIT_TIME_LIMIT_EXCEEDED > >> >>> >> is a significant factor in those failures, > and if so, > >> consider the > >> other > >> >>> >> remediation approaches (apart from > outliers, which isn't > >> relevant > >> in this case) > >> >>> >> that I suggested to Eric Mcintosh at LHC. > >> > > >> > > >> > > >> _________________________________________________ > >> boinc_dev mailing list > >> [email protected] <mailto: [email protected] > > >> <mailto: boinc_dev@ssl.__berkeley.edu <mailto: > [email protected] >> > >> > >> http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev > >> < http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > > >> To unsubscribe, visit the above URL and > >> (near bottom of page) enter your email address. > >> > >> > >> _________________________________________________ > >> boinc_dev mailing list > >> [email protected] <mailto: [email protected] > > >> http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev > >> < http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > > >> To unsubscribe, visit the above URL and > >> (near bottom of page) enter your email address. > >> > >> > >_______________________________________________ > >boinc_dev mailing list > >[email protected] > >http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > >To unsubscribe, visit the above URL and > >(near bottom of page) enter your email address. > > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
