[GitHub] spark pull request: [SPARK-13131] [SQL] Use best time in benchmark

mengxr Tue, 02 Feb 2016 01:30:49 -0800

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11018#issuecomment-178473958
  
    @davies I think this depends on the error distribution.
    
    Let's assume that the measured running time of algorithm A is described by 
an additive model: a + W, where a is a constant indicating the ideal running 
time and w is a *positive* random variable describing system noise/overhead. We 
assume the same error distribution for algorithm B: b + W. Basically, we want 
to test which one is smaller (faster), a or b. One common way is to compare the 
sample mean values of `a + W` and `b + W`, and you want to compare the sample 
min values of `a + W` and `b + W`.
    
    If we agree on this model, which method is better majorly depends on the 
variance of the sample mean and sample min (first order statistic). We know 
that the variance of sample mean is of order `O(1/n)` (CLT), while the variance 
of the first order statistic is very sensitive to the distribution. If W 
follows uniform distribution, the variance of the first order statistics is of 
order `O(1/n^2)`, which is indeed better than that of the sample mean. However, 
if the error distribution has little mass near 0, the variance of the first 
order statistic could be very large. And this is very easy to verify 
numerically.
    
    Certainly we can do many runs and draw the empirical error distribution out 
and then tell which one is better for this case. But without good knowledge of 
the error distribution, using sample mean is definitely a safe bet because we 
know the variance is of order `O(1/n)`. If we want to avoid outliers, a common 
solution is to use the median, following similar arguments.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-13131] [SQL] Use best time in benchmark

Reply via email to