I don't like to invoke the black magic card here. I have tried benchmarking in different ways in different scenarios and the results are consistent. It is also reproducable by others.
FWIW this is what lead me to this https://github.com/JuliaDiff/ForwardDiff.jl/issues/57