On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote:
On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote:
BTW: -march=native automatically implies -mtune=native

Thanks, I`ll remove mtune)

It would be really interesting if you could try writing the same code in c, both a scalar version and a version using gcc's vector instrinsics, to allow us to compare performance and identify areas for D to improve.

I am lazy )

I have looked at assembler code:

float, scalar (main loop):
        vmovss  xmm1, DWORD PTR [rsi+rax*4]
        vfmadd231ss     xmm0, xmm1, DWORD PTR [rcx+rax*4]
        add     rax, 1
        cmp     rax, rdi
        jne     .L191

float, vector (main loop):
        vmovups ymm5, YMMWORD PTR [rax]
        sub     rax, -128
        sub     r11, -128
        vmovups ymm4, YMMWORD PTR [r11-128]
        vmovups ymm6, YMMWORD PTR [rax-96]
        vmovups ymm7, YMMWORD PTR [r11-96]
        vfmadd231ps     ymm3, ymm5, ymm4
        vmovups ymm8, YMMWORD PTR [rax-64]
        vmovups ymm9, YMMWORD PTR [r11-64]
        vfmadd231ps     ymm0, ymm6, ymm7
        vmovups ymm10, YMMWORD PTR [rax-32]
        vmovups ymm11, YMMWORD PTR [r11-32]
        cmp     rdi, rax
        vfmadd231ps     ymm2, ymm8, ymm9
        vfmadd231ps     ymm1, ymm10, ymm11
        ja      .L2448

float, vector (full):

It is pretty optimized)

Best regards


Reply via email to