https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773

            Bug ID: 104773
           Summary: compare with 1 not merged with subtract 1
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*, arm-*-*

std::bit_ceil(x) involves if(x == 0 || x == 1) return 1;
and 1u << (32-clz(x-1)).

The compare of course compiles to an unsigned <= 1, which can be done with a
sub instead of cmp, producing the value we need as an input for the
leading-zero count.  But GCC does *not* do this.  (Neither does clang for
x86-64).  I trimmed down the libstdc++ <bit> code into something I could
compile even when Godbolt is doesn't have working headers for some ISAs:
https://godbolt.org/z/3EE7W5bna

// cut down from libstdc++ for normal integer cases; compiles the same
  template<typename _Tp>
    constexpr _Tp
    bit_ceil(_Tp __x) noexcept
    {
      constexpr auto _Nd = std::numeric_limits<_Tp>::digits;
      if (__x == 0 || __x == 1)
        return 1;
      auto __shift_exponent = _Nd - __builtin_clz((_Tp)(__x - 1u));
      // using __promoted_type = decltype(__x << 1); ... // removed check for
x<<n widening the result
      return (_Tp)1u << __shift_exponent;
    }
}


for x86-64 with GCC trunk -O3 -march=ivybridge, we get this inefficient code:

roundup(unsigned int):
        mov     eax, 1
        cmp     edi, 1
        jbe     .L1
        sub     edi, 1        # could have just done a sub in the first place
        bsr     edi, edi      # correctly avoiding a false dependency by *not*
using ECX as the destination
        lea     ecx, [rdi+1]  # could have shifted  2<<n instead of  1<<(n+1)
        sal     eax, cl       # 3 uops, vs. 1 for bts is a more efficient way
to materialize 1<<n
.L1:
        ret

Also, Ivybridge has no problem with DEC instead of SUB 1, IDK why it's avoiding
DEC here but not for Haswell for example.  (Haswell pessimizes by using
32-lzcnt instead of lzcnt^31 or something, or still just BSR because it
performs identically on actual Haswell; lzcnt is only faster on AMD)

But this bug report is just about sub/cmp combining, not how to materialize
1<<(n+1) or other stuff:  Better would be

    sub  edi, 1
    jbe  .L1
    bsr  edi, edi
    xor  eax, eax
    inc  edi
    bts  eax, edi      # EAX |= 1<<EDI
    ret
.L1:
    mov  eax, 1
    ret


Intel SnB-family can macro-fuse sub/jbe.  AMD can't, so the change is
break-even for front-end uops when the branch is not-taken, and worse when it
is taken.  But it's still smaller code-size.

For ARM, clang finds a very clever way to combine it:

roundup(unsigned int):
        subs    r0, r0, #1
        clz     r0, r0
        rsb     r1, r0, #32         @ 32-clz
        mov     r0, #1
        lslhi   r0, r0, r1          @ using flags set by SUBS
        bx      lr                  @ 1<<(32-clz) or just 1

GCC on the other hand does much worse with -O3 -std=gnu++20 -mcpu=cortex-a53
-mthumb

roundup(unsigned int):
        cmp     r0, #1
        itttt   hi
        addhi   r3, r0, #-1
        movhi   r0, #1
        clzhi   r3, r3
        rsbhi   r3, r3, #32
        ite     hi
        lslhi   r0, r0, r3
        movls   r0, #1
        bx      lr

I suspect we could do better by combining the cmp and addhi, and doing `mov r0,
#1` outside of predication.  (I think that's a separate bug, planning to report
it separately.)  Then one `it` would be enough to cover things, I think.

That would basically reduce it to clang's strategy, although the predication of
the clz and rsb are optional.

Reply via email to