Re: missing optimization "a << (b & 63) to a << b"

Richard Biener Thu, 17 Jan 2013 02:09:52 -0800

On Thu, Jan 17, 2013 at 2:44 AM, Wei Mi <w...@google.com> wrote:
> Hi,
>
> For x86, shift insn will automatically mask the count to 5 bits in 32
> bit mode and to 6 bits in 64 bit mode, so for the testcase below, the
> buf_ << (-end & 63) could be optimized to buf_ << -end. But for trunk
> compiler, some place in the testcase is not optimized.
>
> typedef unsigned long long uint64;
> typedef unsigned int uint32;
>
> class Decoder {
>  public:
>  Decoder() : k_minus_1_(0), buf_(0), bits_left_(0) {}
>  ~Decoder() {}
>
>  uint32 ExtractBits(uint64 end, uint64 start);
>  inline uint32 GetBits(int bits) {
>    uint32 val = ExtractBits(bits, 0);
>    buf_ >>= bits;
>    bits_left_ -= bits;
>    return val;
>  }
>
>  uint32 Get(uint32 bits);
>
>  uint32 k_minus_1_;
>  uint64 buf_;
>  unsigned long bits_left_;
> };
>
> uint32 Decoder::ExtractBits(uint64 end, uint64 start) {
>  return (buf_ << (-end & 63)) >> ((start - end) & 63);
> }
>
> uint32 Decoder::Get(uint32 bits) {
>  bits += k_minus_1_;
>  uint32 msbit = (bits > (k_minus_1_ + 1));
>  return GetBits(bits - msbit) | (msbit << (bits - 1));
> }
>
> The assembly generated by "g++ -O2 -S 1.C"
>
>         .file   "1.c"
>         .text
>         .align 2
>         .p2align 4,,15
>         .globl  _ZN7Decoder11ExtractBitsEyy
>         .type   _ZN7Decoder11ExtractBitsEyy, @function
> _ZN7Decoder11ExtractBitsEyy:
> .LFB7:
>         .cfi_startproc
>         movq    8(%rdi), %rax
>         movl    %esi, %ecx
>         negl    %ecx
>         salq    %cl, %rax                   ===> Here (-end & 63) is
> optimized away.
>         movl    %edx, %ecx
>         subl    %esi, %ecx
>         shrq    %cl, %rax
>         ret
>         .cfi_endproc
> .LFE7:
>         .size   _ZN7Decoder11ExtractBitsEyy, .-_ZN7Decoder11ExtractBitsEyy
>         .align 2
>         .p2align 4,,15
>         .globl  _ZN7Decoder3GetEj
>         .type   _ZN7Decoder3GetEj, @function
> _ZN7Decoder3GetEj:
> .LFB8:
>         .cfi_startproc
>         movl    (%rdi), %eax
>         movq    8(%rdi), %r8
>         addl    %eax, %esi
>         addl    $1, %eax
>         movq    %r8, %r10
>         cmpl    %esi, %eax
>         movl    %esi, %ecx
>         setb    %al
>         movzbl  %al, %eax
>         subl    %eax, %ecx
>         movl    %ecx, %edx
>         shrq    %cl, %r10
>         movslq  %ecx, %r9
>         negl    %edx
>         movq    %r10, 8(%rdi)
>         subq    %r9, 16(%rdi)
>         andl    $63, %edx           ==> Inst A:  the (-end & 63) is
> not optimized away.
>         movq    %r8, %rdi
>         movl    %edx, %ecx
>         salq    %cl, %rdi             ==> Inst B: use the result of
> (-end & 63) here
>         shrq    %cl, %rdi             ==> Inst C: use the result of
> (-end & 63) here
>         leal    -1(%rsi), %ecx
>         sall    %cl, %eax
>         orl     %edi, %eax
>         ret
>         .cfi_endproc
> .LFE8:
>
> In Decoder::Get(), (-end & 63) in Inst A is not optimized away because
> the two (-end & 63) exprs are csed after ExtractBits() is inlined to
> GetBits() then to Get(), then it is a single def feeding to multiple
> down uses, which cannot be optimized by combine phase. This is an old
> problem in combine. It is also the reason why (-end & 63) is optimized
> away in _ZN7Decoder11ExtractBitsEyy.
>
> To overcome the weakness of combine phase for this testcase, I am
> wondering whether we can extend fwprop to solve the problem because
> fwprop handle those "single def to multiple down uses" cases. Existing
> fwprop is restrictive since it only tries to propagate when the src of
> def insn is a reg, subreg or const, and it only tries to propagate
> when simplification after propagate can collapse the expr to a const.
> Can we extend it to try propagate more complex exprs from def to use
> (here propagate the andl operation from Inst A to the shift operands
> in Inst B and Inst C), and let try_fwprop_subst in fwprop.c to decide
> whether to keep the change by comparing the costs?
>
> Or better way to solve the problem? Appreciated a lot!


Combine / simplify-rtx should recognize this at the RTL level for
SHIFT_COUNT_TRUNCATED targets.

Richard.

> Thanks,
> Wei.

Re: missing optimization "a << (b & 63) to a << b"

Reply via email to