On Thu, Jan 17, 2013 at 2:44 AM, Wei Mi <w...@google.com> wrote: > Hi, > > For x86, shift insn will automatically mask the count to 5 bits in 32 > bit mode and to 6 bits in 64 bit mode, so for the testcase below, the > buf_ << (-end & 63) could be optimized to buf_ << -end. But for trunk > compiler, some place in the testcase is not optimized. > > typedef unsigned long long uint64; > typedef unsigned int uint32; > > class Decoder { > public: > Decoder() : k_minus_1_(0), buf_(0), bits_left_(0) {} > ~Decoder() {} > > uint32 ExtractBits(uint64 end, uint64 start); > inline uint32 GetBits(int bits) { > uint32 val = ExtractBits(bits, 0); > buf_ >>= bits; > bits_left_ -= bits; > return val; > } > > uint32 Get(uint32 bits); > > uint32 k_minus_1_; > uint64 buf_; > unsigned long bits_left_; > }; > > uint32 Decoder::ExtractBits(uint64 end, uint64 start) { > return (buf_ << (-end & 63)) >> ((start - end) & 63); > } > > uint32 Decoder::Get(uint32 bits) { > bits += k_minus_1_; > uint32 msbit = (bits > (k_minus_1_ + 1)); > return GetBits(bits - msbit) | (msbit << (bits - 1)); > } > > The assembly generated by "g++ -O2 -S 1.C" > > .file "1.c" > .text > .align 2 > .p2align 4,,15 > .globl _ZN7Decoder11ExtractBitsEyy > .type _ZN7Decoder11ExtractBitsEyy, @function > _ZN7Decoder11ExtractBitsEyy: > .LFB7: > .cfi_startproc > movq 8(%rdi), %rax > movl %esi, %ecx > negl %ecx > salq %cl, %rax ===> Here (-end & 63) is > optimized away. > movl %edx, %ecx > subl %esi, %ecx > shrq %cl, %rax > ret > .cfi_endproc > .LFE7: > .size _ZN7Decoder11ExtractBitsEyy, .-_ZN7Decoder11ExtractBitsEyy > .align 2 > .p2align 4,,15 > .globl _ZN7Decoder3GetEj > .type _ZN7Decoder3GetEj, @function > _ZN7Decoder3GetEj: > .LFB8: > .cfi_startproc > movl (%rdi), %eax > movq 8(%rdi), %r8 > addl %eax, %esi > addl $1, %eax > movq %r8, %r10 > cmpl %esi, %eax > movl %esi, %ecx > setb %al > movzbl %al, %eax > subl %eax, %ecx > movl %ecx, %edx > shrq %cl, %r10 > movslq %ecx, %r9 > negl %edx > movq %r10, 8(%rdi) > subq %r9, 16(%rdi) > andl $63, %edx ==> Inst A: the (-end & 63) is > not optimized away. > movq %r8, %rdi > movl %edx, %ecx > salq %cl, %rdi ==> Inst B: use the result of > (-end & 63) here > shrq %cl, %rdi ==> Inst C: use the result of > (-end & 63) here > leal -1(%rsi), %ecx > sall %cl, %eax > orl %edi, %eax > ret > .cfi_endproc > .LFE8: > > In Decoder::Get(), (-end & 63) in Inst A is not optimized away because > the two (-end & 63) exprs are csed after ExtractBits() is inlined to > GetBits() then to Get(), then it is a single def feeding to multiple > down uses, which cannot be optimized by combine phase. This is an old > problem in combine. It is also the reason why (-end & 63) is optimized > away in _ZN7Decoder11ExtractBitsEyy. > > To overcome the weakness of combine phase for this testcase, I am > wondering whether we can extend fwprop to solve the problem because > fwprop handle those "single def to multiple down uses" cases. Existing > fwprop is restrictive since it only tries to propagate when the src of > def insn is a reg, subreg or const, and it only tries to propagate > when simplification after propagate can collapse the expr to a const. > Can we extend it to try propagate more complex exprs from def to use > (here propagate the andl operation from Inst A to the shift operands > in Inst B and Inst C), and let try_fwprop_subst in fwprop.c to decide > whether to keep the change by comparing the costs? > > Or better way to solve the problem? Appreciated a lot!
Combine / simplify-rtx should recognize this at the RTL level for SHIFT_COUNT_TRUNCATED targets. Richard. > Thanks, > Wei.