https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122014

            Bug ID: 122014
           Summary: (AArch64) Optimize 8-bit and 16-bit popcount as
                    special cases
           Product: gcc
           Version: 15.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: Explorer09 at gmail dot com
  Target Milestone: ---

GCC for AArch64 supports using CNT instruction for popcount. However, when it
comes to popcount for 8-bit or 16-bit integers, the code GCC emits is not the
shortest possible.

This is a feature request - to implement 8-bit and 16-bit popcount operations
as special cases.

I show my implementation using intrinsics, and a comparison with GCC's builtin.
In particular there's no need to bitwise-AND the value if the input is 8 bits
or 16 bits.

The implementation using intrinsics should also work with ARMv7-A+NEON, but I'm
filing this report for AArch64 only, as it seems that GCC doesn't yet support
popcount using NEON there.

```c
#include <stdint.h>
#if defined(__ARM_NEON)
#include <arm_neon.h>
unsigned int popcount_8(uint8_t x) {
    // Set all lanes at once so that the compiler doesn't need to mask
    // out the upper bits.
    uint8x8_t v = vdup_n_u8(x);
    v = vcnt_u8(v);
    return vget_lane_u8(v, 0);
}
unsigned int popcount_16(uint16_t x) {
    uint16x4_t v_h = vdup_n_u16(x);
    uint8x8_t v_b = vcnt_u8(vreinterpret_u8_u16(v_h));
    v_h = vpaddl_u8(v_b);
    return vget_lane_u16(v_h, 0);
}
#endif
unsigned int popcount_8_b(uint8_t x) {
    return (unsigned int)__builtin_popcountg(x);
}
unsigned int popcount_16_b(uint16_t x) {
    return (unsigned int)__builtin_popcountg(x);
}
```

(Tested in Compiler Explorer)
ARM64 GCC 15.2.0 with `-Os` option:

```assembly
popcount_8:
        dup     v31.8b, w0
        cnt     v31.8b, v31.8b
        umov    w0, v31.b[0]
        ret
popcount_16:
        dup     v31.4h, w0
        cnt     v31.8b, v31.8b
        uaddlp  v31.4h, v31.8b
        umov    w0, v31.h[0]
        ret
popcount_8_b:
        and     w0, w0, 255
        fmov    d31, x0
        cnt     v31.8b, v31.8b
        smov    w0, v31.b[0]
        ret
popcount_16_b:
        and     x0, x0, 65535
        fmov    d31, x0
        cnt     v31.8b, v31.8b
        addv    b31, v31.8b
        fmov    w0, s31
        ret
```

armv8-a clang 21.1.0 with `-Os` option:

```assembly
popcount_8:
        fmov    s0, w0
        cnt     v0.8b, v0.8b
        umov    w0, v0.b[0]
        ret
popcount_16:
        dup     v0.4h, w0
        cnt     v0.8b, v0.8b
        uaddlp  v0.4h, v0.8b
        umov    w0, v0.h[0]
        ret
```

(It looks like both FMOV instruction and DUP instruction work - pick one that
is cheaper.)

(I've also reported the issue in Clang:
https://github.com/llvm/llvm-project/issues/159552 )

Reply via email to