> Here those square roots are parallelizable, the compiler is allowed to use a 
> SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 
> instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic 
> _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But 
> maybe its starting location needs to be aligned to 16 bytes (not currently 
> supported syntax):

The 32bit assembly produced by the Intel Fortran compiler on that code, it's 
heavily optimized and fully inlined:
http://codepad.org/h1ilZWVu

It uses only serial square roots (sqrtsd), so the performance improvement has 
other causes that I don't know. This also probably means the Fortran version is 
not the faster version possible.

Bye,
bearophile

Reply via email to