Very radical centrist! But if it is this tricky in hard science, how will we 
ever get this right in the social sciences?

E



The truth, the whole truth, and nothing but the truth: a pragmatic guide to 
assessing empirical evaluations
https://blog.acolyer.org/2017/01/26/the-truth-the-whole-truth-and-nothing-but-the-truth-a-pragmatic-guide-to-assessing-empirical-evaluations/
(via Instapaper)

The truth, the whole truth, and nothing but the truth: A pragmatic guide to 
assessing empirical evaluations Blackburn et al. ACM Transactions on 
Programming Languages and Systems 2016

Yesterday we looked at some of the ways analysts may be fooled into thinking 
they’ve found a statistically significant result when in fact they haven’t. 
Today’s paper choice looks at what can go wrong with empirical evaluations.

Let’s start with a couple of definitions:

An evaluation is either an experiment or an observational study, consisting of 
steps performed and data produced from those steps.
A claim is an assertion about the significance and meaning of an evaluation; 
thus, unlike a hypothesis, which precedes an evaluation, a claim comes after 
the evaluation.
A sound claim is one where the evaluation provides all the evidence necessary 
to support the claim, and does not provide any evidence that contradicts the 
claim.
Assuming honest researchers, things can go wrong in one of two basic ways: sins 
of reasoning are those where the claim is not supported by the evaluation; and 
sins of exposition are those where the description of either the evaluation or 
the claim is not sufficient for readers to evaluate and/or reproduce the 
results.



As the authors show, it’s not easy to avoid these traps even when you are doing 
your best. We’ll get into the details shortly, but for those authoring (or 
reviewing) evaluations and claims, here are five questions to help you avoid 
them:

Has all of the data from the evaluation been considered, and not just the data 
that supports the claim?
Have any assumptions been made in forming the claim that are not justified by 
the evaluation?
If the experimental evaluation is compared to prior work, is it an 
apples-to-oranges comparison?
Has everything essential for getting the results been described?
Has the claim been unambiguously and clearly stated?
The two sins of exposition

The sin of inscrutability occurs when poor exposition obscures a claim. That 
is, we can’t be sure about what is really being claimed! The most basic form of 
inscrutability is simply omission – leaving the reader to figure out what the 
claims might be by themselves! Claims may also be ambiguous – for example, 
“improves performance” by 10% (throughput or latency?), or “reduces latency by 
10%” (at what percentile and under what load?). Distorted claims are clear from 
the reader’s perspective, but don’t actually capture what the author intended.

The sin of irreproducability occurs when poor exposition obscures an 
evaluation: the evaluation steps, the data, or both. The evaluation may not be 
reproducible because key information is omitted, imprecise language and lack of 
detail may lead to ambiguity. One important kind of omission that is hard to 
guard against is an incomplete understanding of the factors that are relevant 
to the evaluation.

Experience shows that whole communities can ignore important, though 
non-obvious, aspects of the empirical environment, only to find out their 
significance much later, throwing into question years of published results. 
This situation invites two responses: (i) authors should be held to the 
community’s standard of what is known about the importance of the empirical 
environment, and (ii) the community should intentionally and actively improve 
this knowledge, for example, by promoting reproduction studies.

A good example of an unknown significant factor turns out to be the size of 
environment variables when measuring speed-ups of gcc in Linux! The size of the 
environment affects memory layout, and thus the performance of the program. An 
evaluation only exploring one (unspecified) environment size leads to the sin 
of irreproducibility. Researchers knew of course that memory layout affects 
performance, but it wasn’t obvious that environment variables would affect it 
enough to make a difference… until someone measured it.



The three sins of reasoning

The three sins of reasoning are the sin of ignorance, the sin of 
inappropriateness, and the sin of inconsistency. The sin of ignorance occurs 
when a claim is made that ignores elements of the evaluation supporting a 
contradictory alternative claim (i.e., selecting data points to substantiate a 
claim while ignoring other relevant points).

In our experience, while the sin of ignorance seems obvious and easy to avoid, 
in reality, it is far from it. Many factors in the evaluation that seem 
irrelevant to a claim may actually be critical to the soundness of the claim.

Consider the following latency histogram for a component of Gmail that has both 
a fast-path and a slow-path through it.



The distribution is bimodal, which means that commonly used statistical 
techniques making assumptions the data is normally distributed do not apply. If 
a claim is made on the basis of such techniques, it is unsound and we have 
committed the sin of ignorance. This is easy to see when you have the 
histogram, but suppose the experiment calculated mean and standard-deviation on 
the fly – then it may not be so obvious to us at all!

The sin of inappropriateness occurs when a claim is made that is predicated on 
some fact that is absent from the evaluation.

In our experience, while the sin of inappropriateness seems obvious and easy to 
avoid, in reality, it is far from it. Many factors that may be unaccounted for 
in evaluation may actually be important to derive a sound claim.

Consider an evaluation trying to measure energy consumption. Often execution 
time is measured as a proxy for this, but it turns out that other factors can 
also affect energy consumption, and thus execution time does not support a 
claim about energy consumption. Knowing that execution time is not always an 
accurate proxy for energy consumption is not obvious – until it was pointed out 
in a 2001 paper. As another example, before it was widely understood that heap 
sizes have a big impact on garbage collector performance, many papers derived a 
claim from an evaluation on only one heap size, and did not report it.

The sin of inconsistency is a sin of faulty comparison – two systems are 
compared, but each is evaluated in a way that is inconsistent with the other.

In our experience, while the sin of inconsistency seems obvious and easy to 
avoid, in reality, it is far from it. Many artifacts that seem comparable may 
actually be inconsistent.

Consider the use of hardware performance counters to evaluate performance…

…the performance counters that are provided by different vendors may vary, and 
even the performance counters provided by the same vendor may vary across 
different generations of the same architecture. Comparing data from what may 
seem like similar performance counters within an architecture, across 
architectures, or between generations of the same architecture may result in 
the sin of inconsistency, because the hardware performance counters are 
counting different hardware events.

Comparing against inconsistent workloads would be another way of falling foul 
of this sin: for example, testing one algorithm during the first half of a 
week, and another algorithm during the second half of a week, when the 
underlying distribution of workload is not uniform across the week.



Sent from my iPhone

-- 
-- 
Centroids: The Center of the Radical Centrist Community 
<[email protected]>
Google Group: http://groups.google.com/group/RadicalCentrism
Radical Centrism website and blog: http://RadicalCentrism.org

--- 
You received this message because you are subscribed to the Google Groups 
"Centroids: The Center of the Radical Centrist Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to