This recursion unpacking/unrolling trick that gcc does (at call-site if 
insulated by call via volatile function ptr, and always inside the recursive 
impl) is, in my experience, a rare compiler optimization, but maybe it will 
catch on. clang does _neither_. If you `objdump -D` the executable (or 
disassemble in `gdb`/etc.) you will see the single `callq` to Fibonacci at the 
entry with the full N and a pair of `callq` inside the impl. So with clang the 
full [1.618**n 
work](https://stackoverflow.com/questions/360748/computational-complexity-of-fibonacci-sequence)
 happens. On my i7-6700K Linux with gcc-7.1.0&clang-4.0.1, I get a time ratio 
of about 15.3 to 1 (53.5s/3.5s).

For what it's worth, it's likely the compiler writer guys think of this as 
mostly a "remove a few funcalls here and there" type optimization..maybe use 
SSE/AVX/branch predictor a bit better, maybe save a little dynamic stack usage, 
etc. This doubly recursive Fibonacci approach just happens to be about as 
sensitive as possible to that particular optimization because it can save 
exponentially many funcalls.

Reply via email to