The best benchmark for a system is the actual running application. This truism has been long known as established fact. The only reason synthetic benchmarks are used is that they are a lot easier to port to other systems.
Let us not make perfect the enemy of good enough on the other hand. Yes, SaH is very specific code and does not match any other project perfectly. But, the real question we are trying to address is scalability and rough sizing. Back in the day when I did do SaH and I got a new system and it did SaH twice as fast I knew that within a plus or minus that system was going to do Einstein tasks about twice as fast ... but if we have more than one test tool the aggregate of them is better than the application of an unstable and unreliable synthetic benchmark. Yes architecture counts for a lot and there are detectable differences between AMD and Intel and if you spend enough time on it even within chip families and from one fab run to another. And that is going to affect any measurement system ... but the intent of Dr. Anderson's proposal (and mine of yore) is to use the averaging over large numbers of systems to try to come up with numbers that are good enough and better than the slop we have now. Am I advocating that this, or any other system will be perfect? No ... I never thought that any of them would be ... but I want to get the project guys out of the credit business ... but preserve some of the qualities of fairness and parity that were part of the original design intent. What we should be trying to do is to get rid of the instabilities and sources of perceived unfairness and to move to an automated system to get the project types out of the business of fiddling with credit issues. Eric's script is a different attempt to get us from here to there ... at the cost of constant deflation and the complete severance of the definition of a Cobblestone from the published definition. Oh, and it is purely based on that SaH task standard that you argued against in the other part of post ... :) On Sep 21, 2009, at 9:28 AM, Lynn W. Taylor wrote: > Unfortunately, at the end of the day, replacing the benchmark with a > reference work unit is just replacing one arbitrary benchmark with a > different arbitrary benchmark. > > The problem with the existing benchmark is that the benchmark code > doesn't represent the instruction mix for the project. > > When you say "We'll use a specific s...@home work unit as a > 'reference' work unit" you have the same problem: the instruction > mix does not match any other project. > > The instruction mix likely doesn't even match from Multibeam to > Astropulse. > > In fact, one criticism of the benchmark is that it fits in the cache > on virtually every modern processor. Multibeam fits in the cache on > high-end processors and does not fit on the low end. > > .... meaning processor architecture still matters a lot more that > we'd like. > > Can the credit system be improved? Yes. Is working out the credit > multiplier difficult? Yes. Is it possible to devise a credit > system with perfect cross-project parity? No. > > -- Lynn > > Paul D. Buck wrote: >> Though it looks like the conversation died down again ... I think >> there are a couple points yet to be made. >> If I had one and only one objection to be made it is that this >> system seems to be based upon the benchmarking system without any >> attempts being made to correct for those deficiencies (as best I >> can tell). To my mind the worst feature of the benchmarks was not >> that they were inaccurate, but they cannot be replicated. >> Repeated runs even on systems that are quiescent can get reported >> results that cover as spread with as much as a 20% variance. >> The concept of a "reference job" I am happy to see as that was the >> cornerstone of the proposal I made for use of calibration to >> quantify and test our systems in the BOINC universe. See: >> http://www.boinc-wiki.info/Improved_Benchmarking_System_Using_Calibration_Concepts >> I still see SaH as one of the "best" sources in that the source is >> public and probably the best understood. Most importantly it >> should be relatively easy to make known test tasks by hand that >> have known characteristics that can be tested and to a great >> extent perhaps even tested with instrumented code so that precise >> counts of FLOPS could be made. >> An assumption is made that the GPU versions will be more >> efficient. I think Aqua found that the converse is true (I do not >> know this for sure, it was in a post I read the other day in >> discussing projects with GPU applications that they dropped the >> GPU version because it was worse than the CPU version - multi- >> threaded). >> It may be that I am too dense to get it, but I also do not see how >> this proposal would adequately address the quality metrics we >> might extract from those projects where there are applications >> that span the types and classes of computing resources. For >> example, the two "best" projects at this time are MilkyWay and >> Collatz in that they have applications that span all three of the >> currently available computing resources: CPU, Nvidia CUDA, and ATI >> Stream. >> And finally, the issue of optimized applications vs. "stock" >> application ... the hardware will report the same FLOPS but it >> seems to me the faster execution time of the optimized application >> will cause problems. >> Opps, two more finallies, you would require a change to all >> science applications to make this effective and you still require >> the projects to make an initial estimate regardless of its >> accuracy (predicted number of app units). >> On Aug 28, 2009, at 12:45 PM, David Anderson wrote: >>> I'm coming around to the viewpoint that projects shouldn't be >>> expected >>> to supply estimates of job duration or application performance. >>> I think it's feasible to maintain these estimates dynamically, >>> based on actual job runtimes. >>> I've sketched a set of changes that would accomplish this: >>> http://boinc.berkeley.edu/trac/wiki/AutoFlops >>> Comments welcome. >>> >>> BTW, a bonus of the proposed design is that it provides >>> a project-independent credit-granting policy. >>> >>> -- David >>> >>> Richard Haselgrove wrote: >>>> ... if projects >>>> are expected to fine-tune performance metrics down to the >>>> individual >>>> plan_class level, then I'm sorry, but they just won't. I've had >>>> to shout >>>> (loudly and repeatedly) at both AQUA and GPUGrid to get them to >>>> adjust >>>> rsc_fpops_est to within an order of magnitude of reality (in >>>> AQUA's case, >>>> two orders of magnitude). >>> _______________________________________________ >>> boinc_dev mailing list >>> [email protected] >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>> To unsubscribe, visit the above URL and >>> (near bottom of page) enter your email address. >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
