[Bug middle-end/109907] Missed optimization for bit extraction (uses shift instead of single bit-test)

pinskia at gcc dot gnu.org via Gcc-bugs Fri, 26 May 2023 16:11:22 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109907


--- Comment #24 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Georg-Johann Lay from comment #23)
> Thank you so much for looking into this.
> 
> For the test case from comment #21 though, the problem is somewhere in tree
> optimizations.
> 
> > unsigned char lfsr32_mpp_ge0 (unsigned long number)
> > {
> >   unsigned char b = 0;
> >   if (number >= 0) b--;
> >   if (number & (1UL << 29)) b++;
> >   if (number & (1UL << 13)) b++;
> > 
> >   return b;
> > }
> 
> The -fdump-tree-optimized dump reads:
> 
> ;; Function lfsr32_mpp_ge0 (lfsr32_mpp_ge0, funcdef_no=0, decl_uid=1880,
> cgraph_uid=1, symbol_order=0)
> 
> unsigned char lfsr32_mpp_ge0 (long unsigned int number)
> {
>   unsigned char b;
>   long unsigned int _1;
>   long unsigned int _2;
>   _Bool _3;
>   unsigned char _8;
>   _Bool _9;
>   unsigned char _10;
>   unsigned char _11;
> 
>   <bb 2> [local count: 1073741824]:
>   _1 = number_5(D) & 536870912;
>   _2 = number_5(D) & 8192;
>   if (_2 != 0)
>     goto <bb 4>; [50.00%]
>   else
>     goto <bb 3>; [50.00%]
> 
>   <bb 3> [local count: 536870912]:
>   _9 = _1 == 0;
>   _10 = (unsigned char) _9;
>   _11 = -_10;
>   goto <bb 5>; [100.00%]
> 
>   <bb 4> [local count: 536870913]:
>   _3 = _1 != 0;
>   _8 = (unsigned char) _3;
> 
>   <bb 5> [local count: 1073741824]:
>   # b_4 = PHI <_11(3), _8(4)>
>   return b_4;
> }

Oh yes this is where my
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619068.html patch actually
solves.
It should be able to detect that _1 has a non-zero bits of just one bit set and
expand using single_bit_test for _3.  

For an example we get:
;; Generating RTL for gimple basic block 3

;; _11 = -_10;

(insn 10 9 11 (set (reg:QI 51)
        (zero_extract:QI (subreg:QI (reg:SI 43 [ _1 ]) 3)
            (const_int 1 [0x1])
            (const_int 5 [0x5]))) "t21.c":5:6 -1
     (nil))

(insn 11 10 12 (set (reg:QI 53)
        (const_int 1 [0x1])) "t21.c":5:6 -1
     (nil))

(insn 12 11 13 (set (reg:QI 52)
        (xor:QI (reg:QI 51)
            (reg:QI 53))) "t21.c":5:6 -1
     (nil))

(insn 13 12 0 (set (reg/v:QI 48 [ <retval> ])
        (neg:QI (reg:QI 52))) "t21.c":5:6 -1
     (nil))


Which is exactly what you want right?

Overall we get:
```
lfsr32_mpp_ge0:
        push r16
        push r17
/* prologue: function */
/* frame size = 0 */
/* stack size = 2 */
.L__stack_usage = 2
        mov r16,r22
        mov r17,r23
        mov r18,r24
        mov r19,r25
        mov r27,r19
        mov r26,r18
        mov r25,r17
        mov r24,r16
        clr r24
        clr r25
        clr r26
        andi r27,32
        bst r27,5
        clr r24
        bld r24,0
        sbrs r17,5
        subi r24,lo8(-(-1))
.L1:
/* epilogue start */
        pop r17
        pop r16
        ret
```

Which is much better than it was before (still could be improved more though
but that is for a different time).

To finish this patch up, I am supposed to do some cost modelling and such which
I might get to this weekend.


Even for aarch64 we do slightly better:
```
lfsr32_mpp_ge0:
        and     x1, x0, 536870912
        tbnz    x0, 13, .L2
        cmp     x1, 0
        csetm   w0, eq
        and     w0, w0, 255
        ret
.L2:
        cmp     x1, 0
        cset    w0, ne
        ret
```


```
lfsr32_mpp_ge0:
        and     x1, x0, 536870912
        tbnz    x0, 13, .L2
        ubfx    x0, x1, 29, 1
        eor     w0, w0, 1
        neg     w0, w0
        and     w0, w0, 255
        ret
.L2:
        ubfx    x0, x1, 29, 1
        and     w0, w0, 255
        ret
```

Though if there would be a way to remove the first and's, that would be best
(and the last and, the last one is still there because of fuzziness with
combine trying to do ne there still).

Note even though it does increase in size, the cost of cset on some processors
is 2 cycles rather than 1.

[Bug middle-end/109907] Missed optimization for bit extraction (uses shift instead of single bit-test)

Reply via email to