On OS X using ldc and clang, I get C++: ~6s and D: ~7s. The
slowdown in D seems to be due to parts of nextPermutation not
ending up inlined.
Be careful with benchmarks like this, you are giving the
compiler a lot more information than it usually has in any real
world case (here it knows the exact values of all the input
data/parameters!).
I've compiled and run the code on Win7 x64. I think it's a
compiler issue then, with DMD causing the same slow behaviour on
Linux x64. If so, I should possibly resort to e. g. LDC.