https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122014
Bug ID: 122014 Summary: (AArch64) Optimize 8-bit and 16-bit popcount as special cases Product: gcc Version: 15.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: Explorer09 at gmail dot com Target Milestone: --- GCC for AArch64 supports using CNT instruction for popcount. However, when it comes to popcount for 8-bit or 16-bit integers, the code GCC emits is not the shortest possible. This is a feature request - to implement 8-bit and 16-bit popcount operations as special cases. I show my implementation using intrinsics, and a comparison with GCC's builtin. In particular there's no need to bitwise-AND the value if the input is 8 bits or 16 bits. The implementation using intrinsics should also work with ARMv7-A+NEON, but I'm filing this report for AArch64 only, as it seems that GCC doesn't yet support popcount using NEON there. ```c #include <stdint.h> #if defined(__ARM_NEON) #include <arm_neon.h> unsigned int popcount_8(uint8_t x) { // Set all lanes at once so that the compiler doesn't need to mask // out the upper bits. uint8x8_t v = vdup_n_u8(x); v = vcnt_u8(v); return vget_lane_u8(v, 0); } unsigned int popcount_16(uint16_t x) { uint16x4_t v_h = vdup_n_u16(x); uint8x8_t v_b = vcnt_u8(vreinterpret_u8_u16(v_h)); v_h = vpaddl_u8(v_b); return vget_lane_u16(v_h, 0); } #endif unsigned int popcount_8_b(uint8_t x) { return (unsigned int)__builtin_popcountg(x); } unsigned int popcount_16_b(uint16_t x) { return (unsigned int)__builtin_popcountg(x); } ``` (Tested in Compiler Explorer) ARM64 GCC 15.2.0 with `-Os` option: ```assembly popcount_8: dup v31.8b, w0 cnt v31.8b, v31.8b umov w0, v31.b[0] ret popcount_16: dup v31.4h, w0 cnt v31.8b, v31.8b uaddlp v31.4h, v31.8b umov w0, v31.h[0] ret popcount_8_b: and w0, w0, 255 fmov d31, x0 cnt v31.8b, v31.8b smov w0, v31.b[0] ret popcount_16_b: and x0, x0, 65535 fmov d31, x0 cnt v31.8b, v31.8b addv b31, v31.8b fmov w0, s31 ret ``` armv8-a clang 21.1.0 with `-Os` option: ```assembly popcount_8: fmov s0, w0 cnt v0.8b, v0.8b umov w0, v0.b[0] ret popcount_16: dup v0.4h, w0 cnt v0.8b, v0.8b uaddlp v0.4h, v0.8b umov w0, v0.h[0] ret ``` (It looks like both FMOV instruction and DUP instruction work - pick one that is cheaper.) (I've also reported the issue in Clang: https://github.com/llvm/llvm-project/issues/159552 )