Interesting. Then I'll restate things and try to answer correctly to your 
comments. Sorry, that will be a long post again :/ but I really need that 
to show that the CLT partly fails in this case.
 

*1- Restating things: the function run a single time *

I know this is not what you are interested in. But I hope I'll be clearer 
there.

Some functions can be used rarely but can take a substantial time in the 
whole calculation. From what I see, people just repeat the calculation *n* 
times, take the arithmetic mean and choose whichever gives the smallest 
number.

*Question:* which function of *f1* and *f2* is the fastest for a single 
calculation given that set of inputs on *that* computer?

*Method 1:* put @time in front of each of them and get timing results.
*Problem 1:* this is not the answer to the question above: 

   - Instead of trying to answer to: *which function would always be faster 
   on my computer with that set of input?*
   - he answered: *which function was faster that time?*

*Problem 2: *Chances are high that "always" is too strong. "Generally" 
could be used instead.


*Method 2:*

> if you perform an operation n times, it will take about nμ seconds. If you 
> want your code to run as fast as possible that's probably what matters.
>
So, if I repeat the measurement *n* times and divide the resulting 
execution time by *n*, I should get a good estimate of the expected 
execution time.

*Problem:* "g*enerally*" is a word so vague that it is not enough! What 
does it mean? Pick your poison:

   - Is *f1* faster than* f2* in more than *X*% of the cases? (hint: the 
   arithmetic mean tells us *nothing* about that)
   - What is the most probable execution time? (hint: this is usually *not* 
   the arithmetic mean, eg. the geometric mean for the lognormal ditribution)
   - What's the range including *X*% of the execution times for* f1*/*f2*? 
   (hint: its center is usually *not* the arithmetic mean)
   - etc.

So not everyone wants the same thing. Plus, if you calculate the arithmetic 
mean of some repetitions, you take into account effects that may slow down 
(GC, OS over time, etc.) or speed up (cache, buffers, etc.)  the 
calculation and give a perfectly wrong estimate. The second thing is that 
the arithmetic mean is almost always not what you want and is always not 
enough (at least provide the variance if you are sure the distribution is 
Gaussian). And as John emphasized, the mean or any other location estimator 
tells us nothing about the distribution spread.

*Method 3:* plot histograms/PDF and let other people pick their own poison.
*Problem: *not a single number, but a complete distribution.
*Advantages:* everything is included and distribution metrics can be 
calculated from that; the spread is known; normality can actually be 
checked, if you observe so, you can tell is some function is *always* 
slower than another one, etc.




*2- How the CLT may sometimes partly failed *

Ok, let's state it now: the CLT is working quite well, it's just that some 
of our assumptions about it aren't. :)

In the world of technical computing, we are mostly interested in running 
> the same computationally intensive functions over and over, and we usually 
> only care about how long that will take.
>
comes down to:
*Question:* which function of *f1* and *f2* is the best suited to put 
inside of a tight loop?

You appropriately and correctly invoked the CLT:

> the central limit theorem guarantees that, given a function that takes a 
> mean time μ to run, the amount of time it takes to run that function n 
> times is approximately normally distributed with mean nμ for sufficiently 
> large n regardless of the distribution of the individual function 
> evaluation times as long as it has finite variance.
>

But let's split that a little bit and check it :)

*2.1- The central limit theorem*

The CLT is all about sums. If you draw *n* samples from the distribution* D* 
and perform a sum of them, you will get one number *N1*. If you draw *n* new 
samples 
from the same distribution* D* and perform a sum of them, you will get a 
new number *N2. *etc. Let's say you performed that operation *m* times. 
Then you got *m* sums of *n* elements from distribution *D*: *N1...Nm*.

The central limit theorem states that the distribution of the numbers *N1...Nm 
*should converge to a normal distribution as *n* goes to infinity. It can 
be justified by convolution: the distribution of the *N1...Nm* corresponds 
to the self-convolution of the initial distribution *D*, *n* times. And it 
so happen (and was proven) that any distribution with finite variance 
should converge to a Gaussian as *n* tends to infinity.

Even more interesting, if you divide these sums *N1...Nm* by *n*, you get a 
mean which variance is *always* smaller than the variance  of the initial 
distribution D.

These two behaviors are the roots of the CLT.

*2.2- Wash it, repeat, rinse, n times*

You are proposing to calculate *N1* only, *not* *N1...Nm*, *not* the 
elements of *N1*. The rational behind it being that if *n* is large enough, 
the variance of the resulting self-convoluted distribution should be so 
small that it could be considered a negligible quantity.

That is very sensible. The thing is that, once again, you have no data to 
back that up with a single @time. A raw arithmetic mean really won't help 
you. This is usually fine for very small functions. But when you want to 
benchmark bigger ones, you will do less repetitions, and your approximation 
won't hold that well.

*2.3- Total mean vs. complete CLT*

We can simulate a CLT calculation by considering that our 2020 points are 
actually a series of measurements of *m* groups of *n* repetitions:

<https://lh6.googleusercontent.com/-BJM_JDHrOF0/VBWCoYgvp1I/AAAAAAAAAEY/KU1RdU6NrRI/s1600/CLT0.png>
Here as you can see, the CLT is true: from our complicated initial 
distribution, we got something which is more or less Gaussian and as *n* 
increases and variance seems to drop. Good.

But, what are those red dashes above? Those represent the location of the 
arithmetic mean of the 2020 measurements. As you can see in this case, the 
global mean overestimates the most probable range of our distributions? To 
explain a bit further, the blue curve represents the distribution of the 
mean over 6 consecutive elements from the 2020-valued complete measurement. 
So one could integrate this curve to obtain the cumulative density function 
(CDF). If we look where the arithmetic mean is on these CDF, we can show 
that in the n=6 case, the mean corresponds to 80% of probability (80% of 
measured times are below it) and this number raises to 90% for n=24. That 
shows that the arithmetic mean is not a good location estimator.

But how is that possible? 
*I'll answer in the following parts.*

BTW, is m=84 enough in the n=24 case?
*May I remind you that you say m=1 is enough? But I'll also elaborate on 
that.*

*2.4- Rebuilt CLT from initial distribution*

To explain that discrepancy, we can rebuild the CLT from the initial 
distribution to see it more clearly:

<https://lh4.googleusercontent.com/--_lPgdqUzUw/VBWCug4suEI/AAAAAAAAAEg/VRt1EdT99u0/s1600/CLT1.png>

Good news: we have something more centered on the arithmetic mean; the CLT 
is respected. Bad news: the distributions are completely and utterly wrong!

*2.5- What went wrong?*

Remember?

<https://lh6.googleusercontent.com/-R1m0ZbqDIKo/VBWC0aDzvQI/AAAAAAAAAEo/jHQBDjpbHBA/s1600/CLT2.png>

What it means is that we have periodic stuff. Order matters !!! So we 
cannot just construct the CLP distribution from the initial one and 
ignoring the order. We have to be smarter.

*2.4- Being smarter*

Here was the time vs. iteration plot:

<https://lh5.googleusercontent.com/-UlCyUjoPC1Y/VBWC31U4n9I/AAAAAAAAAEw/a0oFNNLbR1Y/s1600/CLT3.png>
These points are divided into 4 groups clearly visible there. We already 
showed that "Family 0" occurs twice as much as the 3 other families 
together. The yellow circles emphasize a small bump (likely due to OS 
stuff). It's worth noting that my clustering algorithm was very dumb : 
while family 0 could be perfectly separated from family 1, some values that 
should be in family 2 were accounted for in family 3.

We can plot the PDFs corresponding to these curves:

<https://lh3.googleusercontent.com/-Hvp9tr2mlPg/VBWC7IIFj8I/AAAAAAAAAE4/yzcgafv_new/s1600/CLT4.png>
So, here, instead of considering the whole distribution and 
self-convoluting it, we could divide it in 4 families to take ordering into 
account. Now let's look at the distribution of those level. For instance, 
by taking n=6, which means that we are looking at all the consecutive 
groups of 6 points in our data and counting which point belongs to which 
family:

<https://lh5.googleusercontent.com/-_O62fSbNSR4/VBWDAb7GaSI/AAAAAAAAAFA/0JlxzX8X954/s1600/CLT5.png>
As you can see, by taking groups of n=6 iterations, the baseline (family 0) 
converges to an occurrence of 3/6 and the 3 others are roughly 1/3 (most of 
the family 2 -> 3 bumps are only my failure to provide a better clustering 
algorithm).

We are very very lucky: since the occurrences of the 4 families are 
constant through the data, we don't have to perform multivariate kernel 
density estimation (yeah!). Let's take that into account and calculate new 
distribution:

<https://lh3.googleusercontent.com/-D1V2ZSG3Iwk/VBWDC4SmEVI/AAAAAAAAAFI/XJgC7j4D238/s1600/CLT6.png>
Ah! That's *much* better (for the details, I used a Monte-Carlo algorithm, 
hence the values of *m*, the number of MC samples which corresponds to the 
number of averaged values: *N1...Nm*).

Wonderful, now the mechanism is grasped. This is yet another advantage of 
gathering individual values: you can try to reconstruct the final 
distribution. If it is not working (as in the attempt of part 2.4), then 
something more fishy is going on.


*2.5- What about just calculating N1 for very high n numbers?*

This is what you proposed. This is also what triggered that second very 
long post. So let's see now if it is a good approximation:

<https://lh6.googleusercontent.com/-AXk1GbS9v-s/VBWMMJiMw3I/AAAAAAAAAFY/pdIVCXhbDJs/s1600/CLT7.png>
First checks: the CLT is working *very well*. As *n* increases:

   - The distribution gets more and more Gaussian (the right-end tail tends 
   to disappear);
   - The variance is very clearly decreasing.

Now what do those curves mean? At least three things:

   1. If you just calculate *N1*, you are doing with your loops what you 
   may reproach to people who are comparing execution time using @time on one 
   run. The only difference is that if *n* gets larger, the variance gets 
   smaller. But, once again, if you don't measure single values, you can't 
   calculate the variance, so you have absolutely no idea of your error. The 
   curves above are a really good example: the arithmetic mean of my 
   2020-valued sample is located at 98% of the *n*=2020 distribution. That 
   means that the value I genuinely obtained happen to be far off the real 
   distribution mean.
   2. Another observation is even more interesting:  the mode of each 
   distribution is not located at the same point. It's even going forward and 
   backward! Why is that? It's because of the coupling between asymmetry 
   *and* order. One consequence of this is that the execution time of the 
   function in you for loop depends on the number of loops. Hence my advice in 
   my first post: don't benchmark, profile instead, with real data.
   3. If you look at the probability corresponding to the calculated 
   arithmetic mean (red dashes) for each of these *n*, you get very 
   different results (from 78 to 98%). That means that the arithmetic mean is 
   not a good location estimator (just like John said about the median for the 
   normal distribution: it should correspond to the mean and the mode of the 
   distribution but its variance is much larger than the one of the mean).

So the arithmetic mean is not always a good estimator and calculating *N1* 
instead of *N1...Nm* is usually not enough.

*3- Conclusion*


   - Whichever the case considered, getting individual execution times 
   instead of one unique sum of all of these is almost always needed (at least 
   to calculate a variance).
   - Plotting distributions instead of giving mean + variance gives much 
   more useful information (is it always faster? more than 50% of the time? 
   etc.).
   - The CLT is about sums of sums (or means of means), not just about 
   means.
   - You can't determine that *n* is large enough just from the mean.
   - Arithmetic mean is a very bad location estimator if *n* is too small 
   (6000 values is still not enough in our case, will you calculate 6000 
   values with a function taking 1h to execute?).

That said, the mean (just like the median for the normal distribution) is 
not *that* bad of an estimator (it will rarely be completely off) : the 
1e-5 s difference won't matter compared to the 5e-2 s difference between 
Julia and Octave in this case. 

However, you don't estimate those errors, no one will likely do it for you. 
But I can understand that formally calculating errors is a real pain. 
That's why I'm proposing to plot distributions/histograms instead. That 
way, you don't have to bother about the maths. You don't do errors, you 
just see that one distribution is completely/mostly above/below/at the same 
place as the other.

 



Your effect from running the function many times is curious but it's not 
> reproducible for me. After running for 20 minutes I still just see two 
> bands. 
>
That's not really surprising. What I really see are two phenomena: GC and 
no-GC. The origin of the split then is not clear. But maybe if you tweak 
the matrix size, you may reproduce that behavior. Plus, if you look at the 
repetition, you should see that the pattern is quite different: it's the 
reverse from the original calculation pattern (two close bands at small 
execution times and two clearly separated bands at higher execution times).
 

> The difference between the means of the 2020 samples at the beginning, 
> middle, and end is <2%. Can you reproduce that?
>
More or less yes. But I already answered about that above.


Again, sorry for the long post.
Gaël

Reply via email to