Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2

Andy Georges Wed, 06 Feb 2013 14:26:22 -0800

Hi Johan,

On 06 Feb 2013, at 17:04, Johan Tibell <johan.tib...@gmail.com> wrote:

> On Wed, Feb 6, 2013 at 2:09 AM, Simon Marlow <marlo...@gmail.com> wrote:
> This is slightly off topic, but I wanted to plant this thought in people's 
> brains: we shouldn't place much significance in the average of a bunch of 
> benchmarks (even the geometric mean), because it assumes that the benchmarks 
> have a sensible distribution, and we have no reason to expect that to be the 
> case.  For example, in the results above, we wouldn't expect a 14.7% 
> reduction in runtime to be seen in a typical program.
> 
> Using the median might be slightly more useful, which here would be something 
> around 0% for runtime, though still technically dodgy.  When I get around to 
> it I'll modify nofib-analyse to report medians instead of GMs.

No.

> Using the geometric mean as a way to summarize the results isn't that bad. 
> See "How not to lie with statistics: the correct way to summarize benchmark 
> results" (http://ece.uprm.edu/~nayda/Courses/Icom6115F06/Papers/paper4.pdf).

I would argue the exact opposite. The geometric mean has absolutely no meaning 
whatsoever. 

See e.g., 

Computer Architecture Performance Evaluation Methods. L. Eeckhout Synthesis 
Lectures on Computer Architecture, Morgan & Claypool Publishers, June 2010.
Quantifying performance changes with effect size confidence intervals - Tomas 
Kalibera and Richard Jones, 2012 (tech report)
Measuring Computer Performance: A Practitioner's Guide - Lilja, DJ, 2005
The Art of Computer Systems Performance Analysis: techniques for experimental 
design, measurement, simulation, and modelling - Jain, R.  1991

> That being said, I think the most useful thing to do is to look at the big 
> losers, as they're often regressions. Making some class of programs much 
> worse is but improving the geometric mean overall is often worse than 
> changing nothing at all.

Yes.

Regards,
-- Andy Georges

PS. I wrote this a while back for the Evaluate collaborator

# The Mean Is Not A Simple Average

## Application domain

Aggregating measurements. The anti-pattern discusses a single example, though 
for all uses of an average it is important to consider the right mean to use. 
Examples of applicable means are: (weighed) arithmetic, (weighed) harmonic, 
each with respect to the proper weighing factors.

## Premise

You have a set of benchmarks and you wish to quantify your Shiny New Idea 
(SNI). To make it easy to grok the results, you decide to aggregate the impact 
of your Shiny New Idea with a single performance number: the mean of the various
measurements. This means that readers of your paper can easily compare single 
numbers: those for a baseline system, those for other enhancements you compare 
with and of course, the number for your SNI.

## Description

You have implemented your SNI and you wish to conduct a comparison study to 
show that your SNI outperforms existing work and improves execution time (or 
other metrics such as energy consumption, ...) with X% compared to a baseline
system. You design an experiment with a set of benchmarks from an applicable 
benchmark suite and you assemble performance numbers for each benchmark and for 
each scenario (baseline, your SNI, other work, ...). For example, you assemble 
executions times (the metric of choice for single-program workloads) and you 
wish to assess the speedup.

Since people prefer single numbers they can compare to see which one is bigger, 
you must aggregate you data into an average value. While this contains less 
information that the original data set, it is an easy way to see if your SNI 
improves things or not and to prove it to your readers or users.

You should choose a mean that allows you to: (i) directly compare the 
alternatives to each other by canceling out the baseline, (ii) make sure 
(relevant) outliers do not influence your average too much. Clearly, the 
geometric mean is perfectly suited for this purpose. Without further ado, 
determine the per-benchmark speedup for each scenario and you compute the 
various geometric means.

The resulting average values immediately allow you to see if your SNI improves 
the other scenarios derived from existing work. It also allows you to see how 
much you improve over these scenarios by dividing them by the geometric mean of
your SNI. Do not worry, the formula for the geometric mean makes sure that the 
baseline values are canceled out and you effectively get the average speedup of 
your SNI compared to existing work.

Now go ahead and publish these numbers that support your SNI.

## Why this is a bad idea

There may be specific circumstances where the use of a geometric mean is 
warranted, yet producing the average over some benchmark suite for a 
performance metric of your choice is not one of them. Typically, the geometric 
mean can be used when the final aggregate performance number results from 
multiplying individual numbers. For example, when making several enhancements 
to a system, the average improvement per enhancement can be expressed as the 
geometric mean of the speedups resulting from the individual improvements. 
However, for any benchmark suite (regardless of the relative importance one 
attaches to each benchmark in the suite), the aggregate results from adding the 
individual results, as is the case for, e.g., overall speedup. In practically 
all cases using either the (weighed) arithmetic mean of the (weighed) harmonic 
mean is the correct way to compute and report an average.

While it is true that the geometric mean sustains a smaller impact from 
outliers in the measurements compared to the other means, one should always 
investigate outliers and disregard them if there is an indication that the data 
is wrong. Otherwise, they can provide valuable insight. Moreover, by adding 
appropriate weights, one can easily reduce the impact of outliers.

## Example

Suppose you have 5 benchmarks, B1 ... B5. The baseline system has the following 
measurements: 10, 15, 7, 12, and 16, which yields a total execution time of 60. 
Hence, the aggregate score is the sum of the individual scores. Suppose now
you wish to compare two difference enhancements. The first enhancement yields 
the measurements 8, 10, 6, 11, 12 -- adding up to 47; the second enhancement 
yields the measurements 7, 12, 5, 10, 14 -- adding up to 48.

If we take a look at the global improvement achieved, then that is 60/47 = 
1.2766 and 48/60 = 1.25 for enhancement 1 and enhancement 2, respectively. 
Therefore we conclude that by a small margin, enhancement 1 outperforms 
enhancement 2, for this particular set of benchmarks.

However, the geometric means are 1.2604 and 1.2794. From these numbers we would 
conclude the opposite, namely that enhancement 2 outperforms enhancement 1.

Which mean then yields the correct result? The answer is dependent on which 
system to weigh against. If we weigh against the enhanced system, giving the 
benchmarks the weights that correspond to their relative execution time
compared to the execution time of the complete suite (on the same 
configuration), then we need to use a weighed arithmetic mean. If we weigh 
against the baseline system, the correct answer is that we need to use the 
weighted harmonic mean.

Of course, the use of weights is often disregarded. If we assume all benchmarks 
are of equal importance, then we likely will not weigh them. In that case, all 
three means yield the same conclusion, but none of them accurately reflect the
true speedup that is achieved over the entire suite.

## Why is this pattern relevant

The geometric mean is still widely used and accepted by researchers. It can be 
found in papers published at top venues, such as OOPSLA, PLDI, CGO, etc. It is 
commonly used by e.g., VMMark, SPEC CPU, ... On multiple occasions the argument
regarding the impact of outliers is brought forth, even though there are other 
ways to deal with outliers.

References
        • [[1]] J.E., Smith. Characterizing computer performance with a single 
number. CACM 31(10), 1988.
        • [[2]] D.A., Patterson; J.L., Hennessy. Computer Organization and 
Design: The Hardware/Software Approach, Morgan Kaufman.

_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs

Re: nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2

Reply via email to