A few days ago, Ján Dolinský posted some microbenchmark
<https://groups.google.com/d/topic/julia-users/n3LfteWJAd4/discussion>s
about Julia being slower than Octave for squaring quite large matrices. At
that time, I wanted to answer with the traditional *microbenchmarks are
evil*™ type of answer.
I wanted to emphases that:
1. The mean is hardly a good distribution location estimator in the
general (non-normal) case.
2. Neglecting that, what he calculated was a typical execution time for
the squaring of *one and only one* random matrix filled with 7000x7000
random Float64 numbers with his computer.
3. Assuming the 7000x7000 matrix size represents the typical matrix size
for his work, that means that Julia is likely *fast enough*: it's bellow
1 s, the difference between Julia and Octave is not noticeable by anything
except a computer.
But from his latter concerns, I guess this was not really what he had in
mind (but I may have misinterpreted his words). I supposed that what he
really wanted was to calculate *many *of such squared matrices, so many of
them that the observed difference in execution times would matter a lot.
What would be a not-too-bad benchmark adapted to this use case? I guess
repeating the calculation is required. But instead of measuring time
outside of the loop, it might be very interesting to get it inside of the
loop: for each one of the squarings. These individual measurements, if not
too short, would allow us:
- to plot the complete time distribution instead of assuming normality
and providing only one distribution parameter;
- to get the cumulative distribution to represent the case of multiple
consecutive matrix squarings;
- to follow the execution time evolution through time or iterations.
So here is the snippet of code that I used:
t = Float64[]
for i = 1:2020
X = rand(1000,1000)
tic()
XX = X.^2
b = toq()
push!(t, b)
end
The matrix generation is in-loop to avoid any smart caching.
*1- Distributions*
So here are the distributions (mind the Y axis is log):
<https://lh5.googleusercontent.com/-9FN8ZPU-0pc/VBMAGn_zEyI/AAAAAAAAABo/4LgTyyxCATk/s1600/1-Distributions.png>
Octave execution times are small and very close from each other. The
behavior of those two Julia repetitions of the same calculation are much
weirder : there is not just one peak, but 3 or 4. That's a strong hint
*against* mean calculations. Furthermore, complete distributions contain
much more information. I really think that calculating such estimations of
the distributions is the bare minimum when benchmarking.
Ooops, I just said it: *estimations*. The methods used to estimate
distributions usually need arbitrary parameters to be set (here, the
Gaussian kernel and its bandwidth). So the peaks above are of almost
arbitrary width. Who do I know whether these peaks represent something or
are just produced by poor bandwidth choice? Repetitions of the whole
process is one good way to test that. But here, I'll take another approach:
plotting execution time vs. the iteration numbers.
*2- Execution times vs. iteration number*
So let's look at these execution times:
<https://lh5.googleusercontent.com/-J67l7RmdAG0/VBMAKHyh4uI/AAAAAAAAABw/GwLg8oJUbk0/s1600/2-Execution%2Btimes%2Bvs.%2Biteration%2Bnumber.png>
Ohoh! (Not so) Surprising! While Octave seems to behave as expected (random
distribution around a single value), Julia exhibits a completely different
profile: there are 4 different layers. Having done other repetitions,
Julia's baseline seems to always be more or less the same but there are
regular execution time bumps now and then. And if you look very carefully,
you'll see that there is a bump around iteration #1550 in the "Julia
(rep.)" case. More on that later.
*3- Periodogram*
So I just wanted to have a look at these bumps more carefully. Are they
always occurring with a given frequency? A good way to answer this question
is to plot the periodogram:
<https://lh5.googleusercontent.com/-bZxkeMbD3Ik/VBMARRz8SVI/AAAAAAAAAB4/VkXtW-q3CPI/s1600/3-Periodogram_0.png>
Julia periodogram shows that the bumps mostly occur every 3 iterations.
This kind of frequency dependency is not exhibited by Octave.
>From there, I don't really know how to go a lot further than that:
profiling may help and since @time has the capability to distinguish
between calculation- and GC-related execution times, I guess it wouldn't be
so hard to look into this from a GC perspective. This behaviour can also be
clearly seen in some parts of the Time(iteration #) graph:
<https://lh6.googleusercontent.com/-qssnaM10bXQ/VBMAUAzCBMI/AAAAAAAAACA/OA5GAJyPNj4/s1600/3-Periodogram_1.png>
How is that important? Things like periodicity can give very strong hints
about what is actually happening. Raw results are not enough and this is
one reason why most benchmarks are useless. The second reason is that there
is a discrepancy between the question asked and the answer calculated:
let's go forward to cumulative density functions.
*4- Cumulative density function of execution time*
CDF does actually represent the execution time elapsed after a given number
of iterations. This metric is more interesting than
single measurements averaging because it does not presuppose normality.
Here is an example :
<https://lh4.googleusercontent.com/-Gz16JMCGr5A/VBMAYEAs2ZI/AAAAAAAAACI/wjprIGyNuEk/s1600/4-Cumulative%2Bdensity%2Bfunction.png>
Dots are actual values. The red line corresponds to the estimated CDF of
Julia in the case where bumps are avoided. The cyan one corresponds to the
current case (bumps every 3 or so iterations). By chance, these curves are
almost linear which means that taking the average of many values is not so
bad of an approximation. However, no one would ever have noticed that *most
*(2/3) of the cases are calculated quite quickly while 1/3 of them take 3
times as much time to be calculated. And I repeat: by chance.
*5- Conclusion*
Are we finished yet? Of course not. There are so many interactions that
it's really hard to get a useful insight. I've only scratched the surface.
Look at this for instance:
<https://lh6.googleusercontent.com/-3m3VKUBZ4qI/VBMAbWNpVKI/AAAAAAAAACQ/nPTc1TLvqXc/s1600/5-Conclusion.png>
This is a longer Julia run. I've also recorded CPU and RAM usage at the
same time: no correlation at all. But there seems to be an OS- or
Julia-related long term effect on the calculations. Transform that to a CDF
and you'll see that our naive curve of part 4 is only really the tip of the
iceberg:
<https://lh4.googleusercontent.com/-geSdEeNdIZc/VBMLLob-bGI/AAAAAAAAACc/-2kehcTrCgo/s1600/5-conclusion_2.png>
Depending on where you calculated your average, you are likely to
overestimate or to underestimate the actual execution time by a large
margin. This curve is an example of microbenchmarking failure. Benchmarking
is useless: profile complete programs instead.
My purpose was really not to prove anything performance-wise (both Julia
and Octave code were* not* optimized). This was not the point.
I just wanted:
1. to show that it's already very complicated to get meaningful results
on *one's* computer, in *one's* use case, for *one's* particular
calculation with all these efforts to set the problem correctly;
2. to make sure that these periodic behaviours are expected by Julia
devs and known.
So, at least, as a simple reader on these forums, I would be very very very
pleased if we could provide at least histograms of execution time
distributions instead of just giving one raw number. This won't tell a lot
more, but it can make the difference.