https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102487
Bug ID: 102487 Summary: __builtin_popcount(y&3) is not optimized to (y&1)+((y&2)>>1) if don't have popcount optab (or expensive one) Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* Take: int f(unsigned y) { return __builtin_popcount(y&3); } On x86_64 (without popcount optab enabled) this should be optimized just: movl %edi, %eax shrl %edi andl $1, %edi andl $1, %eax addl %edi, %eax ret But we currently get: .cfi_startproc subq $8, %rsp .cfi_def_cfa_offset 16 andl $3, %edi call __popcountdi2 addq $8, %rsp .cfi_def_cfa_offset 8 ret For aarch64 we currently get: and x0, x0, 3 fmov d0, x0 cnt v0.8b, v0.8b addv b0, v0.8b fmov w0, s0 ret vs: and w1, w0, 1 ubfx x0, x0, 1, 1 add w0, w0, w1 ret The second one is much much cheaper as you don't need to move between register sets.