the Fibonacci benchmark

cblake Sun, 06 Sep 2020 08:15:20 -0700

There have been fewer complaints lately, and in some ways that's a sign of a 
growing community which is good, but resurrecting zombie threads with 
irrelevant side points is the pet peeve 
[of](https://forum.nim-lang.org/t/1263#10129) 
[others](https://forum.nim-lang.org/t/21#4446). So, @jibal may want to work on 
being less easily triggered or else open new threads & link back in the future. 
Since @jibal did resurrect this one to the top of the index _and_ follow on to 
my attempt to appease him with (self-righteous|self-deprecating), 
defending-the-innocent justification, I felt a reply warranted, but will add 
some hopefully interesting non-redundant information along the way.


The only thing I say incorrect in the paragraph that @jibal quoted is to 
suggest `OP.fib(0)` is correct (even that is qualified by "I think" as in "I 
suspect" and that slight error is already granted just above). Just from the 
full text of the post @jibal quotes, I clearly explicitly took @miran's 
`fib(2)` example to indicate an issue with index origin (meaning 0 1 1 2.. vs 1 
1 2... or even 0 0 0 1 1 2...). `OP.fib(n)` is fine for `n >= 1`, matching the 
"1 1 2 3 5.." variant after the very first slot in the F(0)=F(1)=1 case (vs 
F(1)=F(2)=1 or whatever). "old school" may be wrong since Fibonacci himself 
probably did F(1)=F(2)=1, but that's pretty vague. Maybe the "new school" is 
the "old school", etc. It's a common enough gotcha to stay right at the 
beginning of the wiki page I linked to.

Anyway, while it's true the OP boundary condition was wrong, and `fib(0)` 
should be 1 there, @miran's text suggests a different issue, since in 0-origin 
terms Fib(2)=2 _as does the OP code_. Sure, @miran _may_ have meant _both_ the 
missing 1 _and_ the indexing shift, but, given the literal text of his reply, 
my quoted paragraph still seems appropriate. Many apologies to @miran if he 
knows for sure that he was offended at the time, if he can even remember his 
thoughts 2 years later - surely partly why zombie thread resurrection is 
disliked.

Besides resurrecting to correct my "maybe-miscorrection", @jibal's point is 
_also_ irrelevant to the main question (which @miran very likely understood 
from his "btw", though more allowance for in-the-moment conversational flow is 
very natural and I'm not criticizing that). The main topic of this thread was 
_cross-backend comparisons_. You can call the 0 1 2 3 5.. series "Fibtastic" if 
you want, but Fibtastic(46) still equals 0-index-origin Fibonacci(46) and still 
had a performance mystery. { Fibtastic does one less recursion making it ~1.62x 
faster for large n, but that only matters for _cross-impl_ comparisons if 
someone naively thought "fibtastic(46) == fibonacci0origin(46) => runtime 
should be equal". }

That returns us to the main response to the actual main question to which I can 
add some more information that may help someone, somewhere, someday. You 
_really_ need call counts because the work done can vary - **_a lot_** 
depending on what backend compilers do and that's become hard to 
predict/control. For example, in my FP C program above (yes, it does 0 1 1 2 3 
5.. so 46 -> 47, depending/if you care, and yeah C code in a Nim forum but the 
backend has enough noise!), the time can vary tons from _many_ minor code 
changes. On gcc-10.2 -O3 on Gentoo Linux x86_64, glibc-2.32-r1, compile and run 
as-is and it runs in 0.137 seconds now (i7-6700k at 4.7GHz) doing only 63041957 
fibonacci calls. Change the `printf` to `exit((int)fibonacci(46) & 255);` makes 
it 2.3x slower and does 148876861 calls (2.36x). Instead add a global `count` 
variable, `count+=1` at the top of `fibonacci` & add to the printf in 
`NimMainInner` \- 4.49 seconds (32.8x) slower and 1980780474 calls ( ** 
_31.42x_** ). Etc., etc. Time ratios track call count well enough, but call 
count varies a lot -- and it's not _just_ the C vs C++ backend and not just the 
NimMainInner call structure. Many things (possibly like @adrianv's 
`if`-reordering) can impact it. But what is it? I may not have explained it (or 
even understood it) well in the past.

Disassembling NimMainInner (for the fast variant) you can see gcc generated 
fib(42), fib(41), .. fib(36) or some such and does the (cheap) linear 
combination. So, what gcc is really doing is **_" partially inlining of 
recursion"_** (that's probably a better term than "unpacking" or "unrolling".. 
Sorry!). In this case, that pays huge dividends because the recursion cost of 
fib(42) is exponentially cheaper (base of golden ratio) than fib(46) and the 
others cheaper still. gcc applies similar tactics inside the recursive 
function, compounding the benefit. (You can do a similar thing just manually 
expanding the recursion yourself with a Pascal's triangle.)

It's distinct from ["tail call" -> iteration style 
transform](https://forum.nim-lang.org/t/2261#42062) (which, e.g., does not 
change code much size based on the size of the loop) or the [remember the 
"first 34"](https://forum.nim-lang.org/t/4253#26491) from compile-time which 
uses more data state. A future compiler could surely inline even more levels 
**_or_** compose the memoization with partial inlining to create >>30x work 
variations, although 1.5 orders of magnitude is already a lot! These tricks 
also do not rely upon FP arithmetic or vectorization, though that helps gcc for 
me. So, this "benchmark" will only become more fraught with peril in the future 
which would not be so bad except that it's so darn popular. In its 2nd life, 
maybe it can become a measure of these tricks not "basic language costs" as it 
tends to be now...

The optimization also _might_ apply to other recursive call scenarios that 
don't have the hyper efficient alternatives Fibonacci does, although rarely 
given how finicky it seems today. What has kept it so finicky? I don't know. 
Ask the gcc maintainers (or the clang people why they can't do it, if that's 
even still the case). I was honestly a little surprised the copy-paste of years 
old code worked for me here.

Now, maybe @adrianv's results came from something else. I never said I knew for 
sure, just that this was the best first thing to check. I could only say so 
much with so much info and ability to reproduce. The initial ratios, both 
myself @miran's inability to get the same 8.5x ratio, and @adrianv's last post 
timing variations all smell like this to me. I stand by my main advice to just 
**_Always Measure Calls As Well as Time_**. That's literally the most (only??) 
important point in this whole thread that deserves repeating far & wide. (It's 
also a good exercise to learn to compile the Nim generated C code manually to 
do PGO builds anyway.) @adrianv, _if_ he's even still around Nim/the 
forum/etc., may well _no longer even have access_ to the same build environment 
(none of which he told us) to reproduce his found 8.5x integer code effect - 
**_yet another_** reason to dislike zombie threads. I know I sure don't have 
easy access to the ancient gcc-5 setup with which I first found the FP behavior 
myself.

With this I've said about as much on this topic as I likely will ever care to 
in one convenient link-back. Cheers to all & keep the threads fresh! ;-)

the Fibonacci benchmark

Reply via email to