https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731
Bug ID: 82731 Summary: _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them. Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include "immintrin.h" #include "inttypes.h" __m256i gather(char *array, uint16_t *offset) { return _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]], array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]], array[offset[15]], array[offset[16]],array[offset[17]], array[offset[18]], array[offset[19]], array[offset[20]], array[offset[21]], array[offset[22]], array[offset[23]], array[offset[24]],array[offset[25]],array[offset[26]], array[offset[27]], array[offset[28]], array[offset[29]], array[offset[30]],array[offset[31]]); } https://stackoverflow.com/questions/46881656/avx2-byte-gather-with-uint16-indices-into-a-m256i https://godbolt.org/g/LEVVwt pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %rbx andq $-32, %rsp subq $40, %rsp movzwl 40(%rsi), %eax ... # more movzwl movq %rax, 32(%rsp) # spill movzwl 38(%rsi), %eax # and reuse ... # more movzwl movzwl 46(%rsi), %r8d movq %rax, 24(%rsp) # spill movzwl 36(%rsi), %eax movzwl 42(%rsi), %edx movq %rax, 16(%rsp) movzwl 34(%rsi), %eax ... ... vpinsrb $1, (%rdi,%r9), %xmm6, %xmm6 vpinsrb $1, (%rdi,%rcx), %xmm5, %xmm5 movq 24(%rsp), %rcx # more reloading vpunpcklwd %xmm6, %xmm3, %xmm3 movzbl (%rdi,%rcx), %edx # and using as a gather index movq 8(%rsp), %rcx vpunpcklwd %xmm5, %xmm1, %xmm1 vpunpckldq %xmm3, %xmm2, %xmm2 vmovd %edx, %xmm0 movzbl (%rdi,%rcx), %edx vpinsrb $1, (%rdi,%rbx), %xmm0, %xmm0 I think gcc is missing the point of vpinsrb, and making too many separate dep chains which it then has to shuffle together. It doesn't have such good throughput on any CPUs that you need more than 2 or 3 dep chains to max out its 1 or 2 per clock throughput. But the main point here is doing all the zero-extension of offset[0..31] before doing *any* of the loads from array[], running out of registers and spilling. See also discussion on that SO question about byte gathers and possibilities for VPGATHERDD being maybe worth it on Skylake.