The best benchmark for a system is the actual running application.   
This truism has been long known as established fact.  The only reason  
synthetic benchmarks are used is that they are a lot easier to port to  
other systems.

Let us not make perfect the enemy of good enough on the other hand.

Yes, SaH is very specific code and does not match any other project  
perfectly.  But, the real question we are trying to address is  
scalability and rough sizing.  Back in the day when I did do SaH and I  
got a new system and it did SaH twice as fast I knew that within a  
plus or minus that system was going to do Einstein tasks about twice  
as fast ... but if we have more than one test tool the aggregate of  
them is better than the application of an unstable and unreliable  
synthetic benchmark.

Yes architecture counts for a lot and there are detectable differences  
between AMD and Intel and if you spend enough time on it even within  
chip families and from one fab run to another.   And that is going to  
affect any measurement system ... but the intent of Dr. Anderson's  
proposal (and mine of yore) is to use the averaging over large numbers  
of systems to try to come up with numbers that are good enough and  
better than the slop we have now.

Am I advocating that this, or any other system will be perfect? No ...  
I never thought that any of them would be ... but I want to get the  
project guys out of the credit business ... but preserve some of the  
qualities of fairness and parity that were part of the original design  
intent.

What we should be trying to do is to get rid of the instabilities and  
sources of perceived unfairness and to move to an automated system to  
get the project types out of the business of fiddling with credit  
issues.

Eric's script is a different attempt to get us from here to there ...  
at the cost of constant deflation and the complete severance of the  
definition of a Cobblestone from the published definition.  Oh, and it  
is purely based on that SaH task standard that you argued against in  
the other part of post ... :)

On Sep 21, 2009, at 9:28 AM, Lynn W. Taylor wrote:

> Unfortunately, at the end of the day, replacing the benchmark with a  
> reference work unit is just replacing one arbitrary benchmark with a  
> different arbitrary benchmark.
>
> The problem with the existing benchmark is that the benchmark code  
> doesn't represent the instruction mix for the project.
>
> When you say "We'll use a specific s...@home work unit as a  
> 'reference' work unit" you have the same problem: the instruction  
> mix does not match any other project.
>
> The instruction mix likely doesn't even match from Multibeam to  
> Astropulse.
>
> In fact, one criticism of the benchmark is that it fits in the cache  
> on virtually every modern processor.  Multibeam fits in the cache on  
> high-end processors and does not fit on the low end.
>
> .... meaning processor architecture still matters a lot more that  
> we'd like.
>
> Can the credit system be improved?  Yes.  Is working out the credit  
> multiplier difficult?  Yes.  Is it possible to devise a credit  
> system with perfect cross-project parity?  No.
>
> -- Lynn
>
> Paul D. Buck wrote:
>> Though it looks like the conversation died down again ... I think   
>> there are a couple points yet to be made.
>> If I had one and only one objection to be made it is that this  
>> system  seems to be based upon the benchmarking system without any  
>> attempts  being made to correct for those deficiencies (as best I  
>> can tell). To  my mind the worst feature of the benchmarks was not  
>> that they were  inaccurate, but they cannot be replicated.   
>> Repeated runs even on  systems that are quiescent can get reported  
>> results that cover as  spread with as much as a 20% variance.
>> The concept of a "reference job" I am happy to see as that was the   
>> cornerstone of the proposal I made for use of calibration to  
>> quantify  and test our systems in the BOINC universe. See: 
>> http://www.boinc-wiki.info/Improved_Benchmarking_System_Using_Calibration_Concepts
>> I still see SaH as one of the "best" sources in that the source is   
>> public and probably the best understood.  Most importantly it  
>> should  be relatively easy to make known test tasks by hand that  
>> have known  characteristics that can be tested and to a great  
>> extent perhaps even  tested with instrumented code so that precise  
>> counts of FLOPS could be  made.
>> An assumption is made that the GPU versions will be more  
>> efficient.  I  think Aqua found that the converse is true (I do not  
>> know this for  sure, it was in a post I read the other day in  
>> discussing projects  with GPU applications that they dropped the  
>> GPU version because it was  worse than the CPU version - multi- 
>> threaded).
>> It may be that I am too dense to get it, but I also do not see how   
>> this proposal would adequately address the quality metrics we  
>> might  extract from those projects where there are applications  
>> that span the  types and classes of computing resources.  For  
>> example, the two "best"  projects at this time are MilkyWay and  
>> Collatz in that they have  applications that span all three of the  
>> currently available computing  resources: CPU, Nvidia CUDA, and ATI  
>> Stream.
>> And finally, the issue of optimized applications vs. "stock"   
>> application ... the hardware will report the same FLOPS but it  
>> seems  to me the faster execution time of the optimized application  
>> will  cause problems.
>> Opps, two more finallies, you would require a change to all  
>> science  applications to make this effective and you still require  
>> the projects  to make an initial estimate regardless of its  
>> accuracy (predicted  number of app units).
>> On Aug 28, 2009, at 12:45 PM, David Anderson wrote:
>>> I'm coming around to the viewpoint that projects shouldn't be  
>>> expected
>>> to supply estimates of job duration or application performance.
>>> I think it's feasible to maintain these estimates dynamically,
>>> based on actual job runtimes.
>>> I've sketched a set of changes that would accomplish this:
>>> http://boinc.berkeley.edu/trac/wiki/AutoFlops
>>> Comments welcome.
>>>
>>> BTW, a bonus of the proposed design is that it provides
>>> a project-independent credit-granting policy.
>>>
>>> -- David
>>>
>>> Richard Haselgrove wrote:
>>>> ...  if projects
>>>> are expected to fine-tune performance metrics down to the  
>>>> individual
>>>> plan_class level, then I'm sorry, but they just won't. I've had  
>>>> to  shout
>>>> (loudly and repeatedly) at both AQUA and GPUGrid to get them to   
>>>> adjust
>>>> rsc_fpops_est to within an order of magnitude of reality (in  
>>>> AQUA's  case,
>>>> two orders of magnitude).
>>> _______________________________________________
>>> boinc_dev mailing list
>>> [email protected]
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to