I haven't thought about it in a while. I had come up with a stable system that would but it wasn't simple and it also required projects to voluntarily participate. Therefore it wouldn't have worked.
The only thought I've had recently is to have a "calibration" plan class that has a non-SIMD non-threaded unoptimized CPU-only app_version that gets sent out once out of every N (~100,000) results. This (as the least efficient app_version) could set the pfc_scale. Again, it would require project participation, so it wouldn't work. So I spend most of my time trying not to think about it. On Tue, Jun 10, 2014 at 12:12 PM, David Anderson <[email protected]> wrote: > Are you saying we're taking the wrong approach? > Any other suggestions? > > > On 10-Jun-2014 11:51 AM, Eric J Korpela wrote: > >> >For credit purposes, the standard is peak FLOPS, >> >i.e. we give credit for what the device could do, >> >rather than what it actually did. >> >Among other things, this encourages projects to develop more efficient >> apps. >> >> It does the opposite because many projects care more about attracting >> volunteers >> than they do about efficient computation. >> >> First: Per second of run time, a host gets the same credit for a >> non-optimized >> stock app as it does for an optimized stock app. There's no benefit to >> the >> volunteer to go to a project with optimized apps. In fact there's a >> benefit for >> users to compile an optimized app for use at a non-optimized project >> where their >> credit will be higher. Every time we optimize SETI@home we get >> bombarded by users >> of non-stock optimized apps get angry because their RAC goes down. That >> makes it a >> disincentive to optimize. >> >> Second: This method encourages projects to create separate apps for GPUs >> rather >> than separate app_versions. Because GPUs obtain nowhere near their >> advertised rates >> for real code, a separate GPU app can earn 20 to 100x the credit of a GPU >> app_version of an app that also has CPU app_versions. >> >> Third: It encourages projects to not use the BOINC credit granting >> mechanisms. To >> compete with projects that have GPU only apps, some projects grant >> outrageous credit >> for everything. >> >> >> >> >> >> On Tue, Jun 10, 2014 at 11:34 AM, David Anderson <[email protected] >> <mailto:[email protected]>> wrote: >> >> For credit purposes, the standard is peak FLOPS, >> i.e. we give credit for what the device could do, >> rather than what it actually did. >> Among other things, this encourages projects to develop more >> efficient apps. >> >> Currently we're not measuring this well for x86 CPUs, >> since our Whetstone benchmark isn't optimized. >> Ideally the BOINC client should include variants for the most common >> CPU features, as we do for ARM. >> >> -- D >> >> >> On 10-Jun-2014 2:09 AM, Richard Haselgrove wrote: >> >> Before anybody leaps into making any changes on the basis of that >> observation, I >> think we ought to pause and consider why we have a benchmark, and >> what we >> use it for. >> >> I'd suggest that in an ideal world, we would be measuring the >> actual running >> speed >> of (each project's) science applications on that particular host, >> optimisations and >> all. We gradually do this through the runtime averages anyway, >> but it's hard to >> gather a priori data on a new host. >> >> Instead of (initially) measuring science application performance, >> we measure >> hardware performance as a surrogate. We now have (at least) three >> ways of >> doing that: >> >> x86: minimum, most conservative, estimate, no optimisations >> allowed for. >> Android: allows for optimised hardware pathways with vfp or neon, >> but >> doesn't relate >> back to science app capability. >> GPU: maximum theoretical 'peak flops', calculated from card >> parameters, then >> scaled >> back by rule of thumb. >> >> Maybe we should standardise on just one standard? >> >> >> ------------------------------__---------------------------- >> --__------------------------ >> >> *From:* Richard Haselgrove <[email protected] >> <mailto:[email protected]>> >> >> *To:* Josef W. Segur <[email protected] >> <mailto:[email protected]>>; David Anderson >> <[email protected] <mailto:[email protected]>> >> *Cc:* "[email protected] <mailto:boinc_dev@ssl. >> berkeley.edu>" >> <[email protected] <mailto:[email protected]>> >> >> *Sent:* Tuesday, 10 June 2014, 9:37 >> *Subject:* Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, >> yes me >> again, but >> >> please read) >> >> http://boinc.berkeley.edu/__gitweb/?p=boinc-v2.git;a=__ >> commit;h=__7b2ca9e787a204f2a57f390bc7249b__b7f9997fea >> <http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h= >> 7b2ca9e787a204f2a57f390bc7249bb7f9997fea> >> >> >__________________________________ >> >> > From: Josef W. Segur <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> >> <mailto:[email protected]>>> >> >To: David Anderson <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>__>> >> >Cc: "[email protected] <mailto:boinc_dev@ssl. >> berkeley.edu> >> <mailto:boinc_dev@ssl.__berkeley.edu <mailto:boinc_dev@ssl. >> berkeley.edu>>" >> <[email protected] <mailto:boinc_dev@ssl. >> berkeley.edu> >> <mailto:boinc_dev@ssl.__berkeley.edu <mailto:boinc_dev@ssl. >> berkeley.edu>>>; >> >> Eric J Korpela >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected].__edu <mailto:[email protected]. >> edu>>>; >> >> Richard Haselgrove >> <[email protected] <mailto:r.haselgrove@ >> btopenworld.com> >> <mailto:r.haselgrove@__btopenworld.com <mailto:r.haselgrove@ >> btopenworld.com>>> >> >> >> >Sent: Tuesday, 10 June 2014, 2:19 >> >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED (sorry, >> yes me >> again, but >> please read) >> > >> > >> >Consider Richard's observation: >> > >> >>> It appears that the Android Whetstone benchmark >> used in the BOINC >> client has >> >>> separate code paths for ARM, vfp, and NEON >> processors: a vfp >> or NEON >> processor >> >>> will report that it is significantly faster than a >> plain-vanilla ARM. >> > >> >If that is so, it distinctly differs from the x86 >> Whetstone which >> never uses >> SIMD, and is truly conservative as you would want for 3). >> >-- >> > Joe >> > >> > >> > >> >On Mon, 09 Jun 2014 16:43:17 -0400, David Anderson >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto: >> [email protected]>__>> wrote: >> > >> >> Eric: >> >> >> >> Yes, I suspect that's what's going on. >> >> Currently the logic for estimating job runtime >> >> (estimate_flops() in sched_version.cpp) is >> >> 1) if this (host, app version) has > 10 results, use >> (host, app >> version) >> statistics >> >> 2) if this app version has > 100 results, use app >> version statistics >> >> 3) else use a conservative estimate based on p_fpops. >> >> >> >> I'm not sure we should be doing 2) at all, >> >> since as you point out the first x100 or 1000 results >> for an app >> version >> >> will generally be from the fastest devices >> >> (and even in the steady state, >> >> app version statistics disproportionately reflect fast >> devices). >> >> >> >> I'll make this change. >> >> >> >> -- David >> >> >> >> On 09-Jun-2014 8:10 AM, Eric J Korpela wrote: >> >>> I also don't have direct access to the server as well, >> so I'm >> mostly guessing. >> >>> Having separate benchmarks for neon and VFP means >> there's a broad >> bimodal >> >>> distribution for the benchmark results. Where the mean >> falls >> depends upon >> the mix >> >>> of machines. In general the neon machines (being newer >> and >> faster) will report >> >>> first and more often, so early on the PFC distribution >> will >> reflect the fast >> >>> machines. Slower machines will be underweighted. So >> the work will be >> estimated to >> >>> complete quickly, and some machines will time out. In >> SETI beta, it >> resolves itself >> >>> in a few weeks. I can't guarantee that it will >> anywhere else. >> >>> >> >>> We see this with every release of a GPU app. The real >> capabilities of graphics >> >>> cards vary by orders of magnitude from the estimate and >> by more >> from each >> other. >> >>> The fast cards report first and most every else hits >> days of timeouts. >> >>> >> >>> One possible fix so to increase the timeout limits for >> the first 10 >> workunits for a >> >>> host_app_version, until host based estimates take over. >> >>> >> >>> >> >>> >> >>> >> >>> On Mon, Jun 9, 2014 at 2:02 AM, Richard Haselgrove >> <[email protected] <mailto:r.haselgrove@ >> btopenworld.com> >> <mailto:r.haselgrove@__btopenworld.com <mailto:r.haselgrove@ >> btopenworld.com>> >> >>> <mailto:r.haselgrove@__btopenworld.com >> <mailto:[email protected]> >> >> <mailto:r.haselgrove@__btopenworld.com >> >> <mailto:[email protected]>>>> wrote: >> >>> >> >>> I think Eric Korpela would be the best person to >> answer that >> question, >> but I >> >>> suspect 'probably not': further investigation over >> the weekend >> suggests >> that the >> >>> circumstances may be SIMAP-specific. >> >>> >> >>> It appears that the Android Whetstone benchmark >> used in the BOINC >> client has >> >>> separate code paths for ARM, vfp, and NEON >> processors: a vfp >> or NEON >> processor >> >>> will report that it is significantly faster than a >> plain-vanilla ARM. >> >>> >> >>> However, SIMAP have only deployed a single Android >> app, which I'm >> assuming only >> >>> uses ARM functions: devices with vfp or NEON SIMD >> vectorisation >> available would >> >>> run the non-optimised application much slower than >> BOINC expects. >> >>> >> >>> At my suggestion, Thomas Rattei (SIMAP admistrator) >> increased the >> >>> rsc_fpops_bound multiplier to 10x on Sunday >> afternoon. I note >> that the >> maximum >> >>> runtime displayed on >> http://boincsimap.org/__boincsimap/server_status.php >> >> <http://boincsimap.org/boincsimap/server_status.php> has >> >>> already increased from 11 hours to 14 hours since >> he did that. >> >>> >> >>> Thomas has told me "We've seen that >> [EXIT_TIME_LIMIT_EXCEEDED] >> a lot. >> However, >> >>> due to Samsung PowerSleep, we thought these are >> mainly "lazy" >> users >> just not >> >>> using their phone regularly for computing." He's >> going to >> monitor how this >> >>> progresses during the remainder of the current >> batch, and I've >> asked >> him to keep >> >>> us updated on his observations. >> >>> >> >>> >> >>> >> >>> >__________________________________ >> >> >>> > From: David Anderson <[email protected] >> <mailto:[email protected]> >> <mailto:[email protected] <mailto: >> [email protected]>__> >> <mailto:[email protected] <mailto:[email protected]> >> >> <mailto:[email protected] <mailto: >> [email protected]>__>>> >> >>> >To: [email protected] >> <mailto:[email protected]> <mailto:boinc_dev@ssl.__ >> berkeley.edu >> <mailto:[email protected]>> >> <mailto:boinc_dev@ssl.__berkeley.edu >> <mailto:[email protected]> <mailto:boinc_dev@ssl.__ >> berkeley.edu >> >> <mailto:[email protected]>>> >> >> >>> >Sent: Monday, 9 June 2014, 3:48 >> >>> >Subject: Re: [boinc_dev] EXIT_TIME_LIMIT_EXCEEDED >> (sorry, yes me >> again, but >> >>> please read) >> >>> > >> >>> > >> >>> >Does this problem occur on SETI@home? >> >>> >-- David >> >>> > >> >>> >On 07-Jun-2014 2:51 AM, Richard Haselgrove wrote: >> >>> > >> >>> >> 2) Android runtime estimates >> >>> >> >> >>> >> The example here is from SIMAP. During a recent >> pause between >> batches, I noticed >> >>> >> that some of my 'pending validation' tasks were >> being slow >> to clear: >> >>> >> >> http://boincsimap.org/__boincsimap/results.php?hostid=__349248 >> >> <http://boincsimap.org/boincsimap/results.php?hostid=349248> >> >>> >> >> >>> >> The clearest example is the third of those >> three workunits: >> >>> >> >> http://boincsimap.org/__boincsimap/workunit.php?wuid=__57169928 >> >> <http://boincsimap.org/boincsimap/workunit.php?wuid=57169928> >> >>> >> >> >>> >> Four of the seven replications have failed with >> 'Error while >> computing', and >> >>> >> every one of those four is an >> EXIT_TIME_LIMIT_EXCEEDED on an >> Android device. >> >>> >> >> >>> >> Three of the four hosts have never returned a >> valid result >> (total >> credit zero), >> >>> >> so they have never had a chance to establish an >> APR for >> use in runtime >> >>> >> estimation: runtime estimates and bounds must >> have been >> generated >> by the server. >> >>> >> >> >>> >> It seems - from these results, and others I've >> found >> pending on >> other machines - >> >>> >> that SIMAP tasks on Android are aborted with >> EXIT_TIME_LIMIT_EXCEEDED after ~6 >> >>> >> hours elapsed. For the new batch released >> today, SIMAP are >> using a >> 3x bound >> >>> >> (which may be a bit low under the >> circumstances): >> >>> >> >> >>> >> <rsc_fpops_est>13500000000000. >> __000000</rsc_fpops_est> >> >>> >> <rsc_fpops_bound>__40500000000000.000000</rsc___ >> fpops_bound> >> >> >>> >> >> >>> >> so I deduce that the tasks when first issued >> had a runtime >> estimate >> of ~2 hours. >> >>> >> >> >>> >> My own tasks, on a fast Intel i5 'Haswell' CPU >> (APR 7.34 >> GFLOPS), >> take over half >> >>> >> an hour to complete: two hours for an ARM >> device sounds >> suspiciously low. The >> >>> >> only one of my Android wingmates to have >> registered an APR >> >>> >> >> >> (http://boincsimap.org/__boincsimap/host_app_versions._ >> _php?hostid=771033 >> <http://boincsimap.org/boincsimap/host_app_versions. >> php?hostid=771033>) is >> >> >>> showing >> >>> >> 1.69 GFLOPS, but I have no way of knowing >> whether that APR was >> established >> >>> before >> >>> >> or after the task in question errored out. >> >>> >> >> >>> >> From experience - borne out by current tests at >> Albert@Home, where >> server logs >> >>> >> are helpfully exposed to the public - initial >> server >> estimates can >> be hopelessly >> >>> >> over-optimistic. These two are for the same >> machine: >> >>> >> >> >>> >> 2014-06-04 20:28:09.8459 [PID=26529] [version] >> [AV#716] >> (BRP4G-cuda32-nv301) >> >>> >> adjusting projected flops based on PFC avg: >> 2124.60G >> 2014-06-07 >> 09:30:56.1506 >> >>> >> [PID=10808] [version] [AV#716] >> (BRP4G-cuda32-nv301) setting >> projected flops >> >>> based >> >>> >> on host elapsed time avg: 23.71G >> >>> >> >> >>> >> Since SIMAP have recently announced that they >> are leaving >> the BOINC >> platform at >> >>> >> the end of the year (despite being an Android >> launch >> partner with >> Samsung), I >> >>> >> doubt they'll want to put much effort into >> researching >> this issue. >> >>> >> >> >>> >> But if other projects experimenting with Android >> applications are >> experiencing a >> >>> >> high task failure rate, they might like to >> check whether >> >>> EXIT_TIME_LIMIT_EXCEEDED >> >>> >> is a significant factor in those failures, and >> if so, >> consider the >> other >> >>> >> remediation approaches (apart from outliers, >> which isn't >> relevant >> in this case) >> >>> >> that I suggested to Eric Mcintosh at LHC. >> > >> > >> > >> _________________________________________________ >> >> boinc_dev mailing list >> [email protected] <mailto:[email protected]> >> <mailto:boinc_dev@ssl.__berkeley.edu <mailto:boinc_dev@ssl. >> berkeley.edu>> >> >> http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev >> >> <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >> >> >> _________________________________________________ >> >> boinc_dev mailing list >> [email protected] <mailto:[email protected]> >> http://lists.ssl.berkeley.edu/__mailman/listinfo/boinc_dev >> >> <http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >> >> >> _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
