https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82139

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Andrew Pinski from comment #1)
> It is worse on the trunk:
> .L2:
>         movdqu  (%rdi), %xmm1
>         movdqu  (%rdi), %xmm0
>         addq    $16, %rdi
>         paddd   %xmm3, %xmm1
>         paddd   %xmm2, %xmm0
>         blendpd $2, %xmm0, %xmm1
>         movups  %xmm1, -16(%rdi)
>         cmpq    %rdi, %rax
>         jne     .L2
> 
> Why two loads from %rdi here?
> This is done during RA as far as I can tell.

It looks like generic cost model should be updated.

w/ -O2 -msse4 -mno-avx -mtune=skylake looks optimal

  movdqu (%rdi), %xmm0
  movdqa %xmm3, %xmm1
  paddd %xmm0, %xmm1
  paddd %xmm2, %xmm0
  blendpd $2, %xmm0, %xmm1
  movups %xmm1, (%rdi)
  addq $16, %rdi
  cmpq %rdi, %rax
  jne .L2

Reply via email to