https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

            Bug ID: 82731
           Summary: _mm256_set_epi8(array[offset[0]], array[offset[1]],
                    ...) byte gather makes slow code, trying to
                    zero-extend all the uint16_t offsets first and
                    spilling them.
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization, ssemmx
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

#include "immintrin.h"
#include "inttypes.h"

__m256i gather(char *array, uint16_t *offset) {

  return _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]],
array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]],
array[offset[7]],
      array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]],
array[offset[12]], array[offset[13]], array[offset[14]], array[offset[15]], 
      array[offset[16]],array[offset[17]], array[offset[18]],
array[offset[19]], array[offset[20]], array[offset[21]], array[offset[22]],
array[offset[23]], 
      array[offset[24]],array[offset[25]],array[offset[26]], array[offset[27]],
array[offset[28]], array[offset[29]], array[offset[30]],array[offset[31]]);
}

https://stackoverflow.com/questions/46881656/avx2-byte-gather-with-uint16-indices-into-a-m256i

https://godbolt.org/g/LEVVwt


        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        andq    $-32, %rsp
        subq    $40, %rsp
        movzwl  40(%rsi), %eax
        ...     # more movzwl
        movq    %rax, 32(%rsp)   # spill
        movzwl  38(%rsi), %eax   # and reuse
        ...     # more movzwl
        movzwl  46(%rsi), %r8d
        movq    %rax, 24(%rsp)   # spill
        movzwl  36(%rsi), %eax
        movzwl  42(%rsi), %edx
        movq    %rax, 16(%rsp)
        movzwl  34(%rsi), %eax
        ...

        ...
        vpinsrb $1, (%rdi,%r9), %xmm6, %xmm6
        vpinsrb $1, (%rdi,%rcx), %xmm5, %xmm5
        movq    24(%rsp), %rcx          # more reloading
        vpunpcklwd      %xmm6, %xmm3, %xmm3
        movzbl  (%rdi,%rcx), %edx       # and using as a gather index
        movq    8(%rsp), %rcx
        vpunpcklwd      %xmm5, %xmm1, %xmm1
        vpunpckldq      %xmm3, %xmm2, %xmm2
        vmovd   %edx, %xmm0
        movzbl  (%rdi,%rcx), %edx
        vpinsrb $1, (%rdi,%rbx), %xmm0, %xmm0

I think gcc is missing the point of vpinsrb, and making too many separate dep
chains which it then has to shuffle together.  It doesn't have such good
throughput on any CPUs that you need more than 2 or 3 dep chains to max out its
1 or 2 per clock throughput.

But the main point here is doing all the zero-extension of offset[0..31] before
doing *any* of the loads from array[], running out of registers and spilling.

See also discussion on that SO question about byte gathers and possibilities
for VPGATHERDD being maybe worth it on Skylake.

Reply via email to