Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/11018#issuecomment-178473958
@davies I think this depends on the error distribution.
Let's assume that the measured running time of algorithm A is described by
an additive model: a + W, where a is a constant indicating the ideal running
time and w is a *positive* random variable describing system noise/overhead. We
assume the same error distribution for algorithm B: b + W. Basically, we want
to test which one is smaller (faster), a or b. One common way is to compare the
sample mean values of `a + W` and `b + W`, and you want to compare the sample
min values of `a + W` and `b + W`.
If we agree on this model, which method is better majorly depends on the
variance of the sample mean and sample min (first order statistic). We know
that the variance of sample mean is of order `O(1/n)` (CLT), while the variance
of the first order statistic is very sensitive to the distribution. If W
follows uniform distribution, the variance of the first order statistics is of
order `O(1/n^2)`, which is indeed better than that of the sample mean. However,
if the error distribution has little mass near 0, the variance of the first
order statistic could be very large. And this is very easy to verify
numerically.
Certainly we can do many runs and draw the empirical error distribution out
and then tell which one is better for this case. But without good knowledge of
the error distribution, using sample mean is definitely a safe bet because we
know the variance is of order `O(1/n)`. If we want to avoid outliers, a common
solution is to use the median, following similar arguments.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]