This recursion unpacking/unrolling trick that gcc does (at call-site if insulated by call via volatile function ptr, and always inside the recursive impl) is, in my experience, a rare compiler optimization, but maybe it will catch on. clang does _neither_. If you `objdump -D` the executable (or disassemble in `gdb`/etc.) you will see the single `callq` to Fibonacci at the entry with the full N and a pair of `callq` inside the impl. So with clang the full [1.618**n work](https://stackoverflow.com/questions/360748/computational-complexity-of-fibonacci-sequence) happens. On my i7-6700K Linux with gcc-7.1.0&clang-4.0.1, I get a time ratio of about 15.3 to 1 (53.5s/3.5s).
For what it's worth, it's likely the compiler writer guys think of this as mostly a "remove a few funcalls here and there" type optimization..maybe use SSE/AVX/branch predictor a bit better, maybe save a little dynamic stack usage, etc. This doubly recursive Fibonacci approach just happens to be about as sensitive as possible to that particular optimization because it can save exponentially many funcalls.
