https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445
--- Comment #7 from Thiago Macieira <thiago at kde dot org> --- Comment on attachment 45800 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45800 gcc9-pr89445.patch Tested and works on my machine. The movzbl that GCC 8 generated is also gone, but it inserted moves *from* the OpMask register: .L4: movq %rcx, %rax addq $64, %rcx cmpq %rdi, %rcx kmovw %k1, %r9d cmova %r8d, %r9d kmovw %r9d, %k1 vmovupd (%rsi,%rax), %zmm1{%k1}{z} addq %rdx, %rax vmovupd (%rax), %zmm2{%k1}{z} vfmadd132pd %zmm0, %zmm2, %zmm1 vmovupd %zmm1, (%rax){%k1} cmpq %rdi, %rcx jb .L4 Seems like it forgot the GPR that used to contain the mask, so it needed to reload from %k1. The end detection is also slightly worse. Yesterday, when I benchmarked with GCC 8, it ran 1000 iterations over 10 million doubles in roughly 11.9 ms, with 10 million instructions. Today, I am getting 11.8 ms at 16 million instructions (the increase of instructions/cycle is roughly equal to the decrease in instructions per iteration, proving that memory bandwidth is the bottleneck)