https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102487

            Bug ID: 102487
           Summary: __builtin_popcount(y&3) is not optimized to
                    (y&1)+((y&2)>>1) if don't have popcount optab (or
                    expensive one)
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-*

Take:
int f(unsigned y)
{
  return __builtin_popcount(y&3);
}

On x86_64 (without popcount optab enabled) this should be optimized just:
        movl    %edi, %eax
        shrl    %edi
        andl    $1, %edi
        andl    $1, %eax
        addl    %edi, %eax
        ret

But we currently get:

        .cfi_startproc
        subq    $8, %rsp
        .cfi_def_cfa_offset 16
        andl    $3, %edi
        call    __popcountdi2
        addq    $8, %rsp
        .cfi_def_cfa_offset 8
        ret

For aarch64 we currently get:

        and     x0, x0, 3
        fmov    d0, x0
        cnt     v0.8b, v0.8b
        addv    b0, v0.8b
        fmov    w0, s0
        ret

vs:

        and     w1, w0, 1
        ubfx    x0, x0, 1, 1
        add     w0, w0, w1
        ret

The second one is much much cheaper as you don't need to move between register
sets.

Reply via email to