> Here those square roots are parallelizable, the compiler is allowed to use a > SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 > instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic > _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But > maybe its starting location needs to be aligned to 16 bytes (not currently > supported syntax):
The 32bit assembly produced by the Intel Fortran compiler on that code, it's heavily optimized and fully inlined: http://codepad.org/h1ilZWVu It uses only serial square roots (sqrtsd), so the performance improvement has other causes that I don't know. This also probably means the Fortran version is not the faster version possible. Bye, bearophile