https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602
--- Comment #1 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to Christoph Diegelmann from comment #0) > GCC misses an optimization on this: > > #include <cstdint> > #include "immintrin.h" > > void test(std::uint16_t* mask, std::uint16_t* data) { > for (int i = 0; i < 1024; ++i) { > *data = 0; > unsigned tmp = *mask++; > unsigned step = _mm_popcnt_u32(tmp); > data += step; > } > } > > g++ -O3 -Wall -std=c++14 -march=skylake generates: > > test(unsigned short*, unsigned short*): > leaq 2048(%rdi), %rdx > .L2: > xorl %eax, %eax > addq $2, %rdi > movw %ax, (%rsi) > popcntw -2(%rdi), %ax > movzwl %ax, %eax > leaq (%rsi,%rax,2), %rsi > cmpq %rdx, %rdi > jne .L2 > ret > > The rax register is known to be zero at the time of `popcntw -2(%rdi), %ax`. > Anyway gcc still clears the upper bits using `movzwl %ax, %eax` afterwards. The "xorl %eax, %eax; movw %ax, (%rsi)" pair is just optimized way to implement "movw $0, (%rsi);". It just happens that peephole pass found unused %eax as an empty temporary reg when splitting direct move of immediate to memory. > While clang uses 32 bit popcnt and `movzwl (%rdi,%rax,2), %ecx` it correctly > recognises that there's no need to clear the upper bits. > > clang -O3 -Wall -std=c++14 -march=skylake -fno-unroll-loops generates: > > test(unsigned short*, unsigned short*): > xorl %eax, %eax > .LBB0_1: > movw $0, (%rsi) > movzwl (%rdi,%rax,2), %ecx > popcntl %ecx, %ecx > leaq (%rsi,%rcx,2), %rsi > addq $1, %rax > cmpl $1024, %eax # imm = 0x400 > jne .LBB0_1 > retq popcntl has a false dependency on its output in certain situations, where popcntw doesn have this limitation. So, gcc choose this approach for a reason.