Thank you for your suprising timings, Jerry.

> All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2.  I built
> with both D and C++ enabled so the back end would be the same.

Is your system a 64 bit one?


> So, the upshot seems like DMD and GDC generate similar code for this test.

This is an uncommon thing, expecially on 32 bit systems.


> And both D compilers generate slightly faster code than the C++
> version, therefore the D front end is doing a slightly better
> optimization job, or your first version is slightly more efficient code.

D1 code is often a bit slower than similar C++ code, but in this case I think 
D2 has allowed to specify more semantics that has produced a faster program. 
The static foreach I have used that D2 code is not just looking more clean 
compared to the those C++0x template tricks, but also the assembly output is 
better.

And the first D2 program is not even the fastest possible: that second D2 
program today is slow, but it contains some more semantics that hopefuly 
someday will allow the second version of the program to be faster than the 
first one, and about as fast as that Fortran version.

This code that currently doesn't compile (no vector ^^, no vector sqrt, no good 
sum function):

immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0));


Is currently implemented like this:

double[NPAIR] distance = void;
foreach (i; Iota!(0, NPAIR))
    distance[i] = sqrt(r[i][0] ^^ 2 + r[i][1] ^^ 2 + r[i][2] ^^ 2);


Here those square roots are parallelizable, the compiler is allowed to use a 
SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions. 
With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd 
in lesser languages) does 4 double squares at a time. But maybe its starting 
location needs to be aligned to 16 bytes (not currently supported syntax):

align(16) immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0));

Bye,
bearophile

Reply via email to