[Bug rtl-optimization/89445] [9 regression] _mm512_maskz_loadu_pd "forgets" to use the mask

thiago at kde dot org Fri, 22 Feb 2019 14:00:11 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445


--- Comment #7 from Thiago Macieira <thiago at kde dot org> ---
Comment on attachment 45800
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45800
gcc9-pr89445.patch

Tested and works on my machine.

The movzbl that GCC 8 generated is also gone, but it inserted moves *from* the
OpMask register:

.L4:
        movq    %rcx, %rax
        addq    $64, %rcx
        cmpq    %rdi, %rcx
        kmovw   %k1, %r9d
        cmova   %r8d, %r9d
        kmovw   %r9d, %k1
        vmovupd (%rsi,%rax), %zmm1{%k1}{z}
        addq    %rdx, %rax
        vmovupd (%rax), %zmm2{%k1}{z}
        vfmadd132pd     %zmm0, %zmm2, %zmm1
        vmovupd %zmm1, (%rax){%k1}
        cmpq    %rdi, %rcx
        jb      .L4

Seems like it forgot the GPR that used to contain the mask, so it needed to
reload from %k1. The end detection is also slightly worse.

Yesterday, when I benchmarked with GCC 8, it ran 1000 iterations over 10
million doubles in roughly 11.9 ms, with 10 million instructions. Today, I am
getting 11.8 ms at 16 million instructions (the increase of instructions/cycle
is roughly equal to the decrease in instructions per iteration, proving that
memory bandwidth is the bottleneck)

[Bug rtl-optimization/89445] [9 regression] _mm512_maskz_loadu_pd "forgets" to use the mask

Reply via email to