[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-29 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #11 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-29 
18:51:51 UTC ---
Author: jakub
Date: Sun May 29 18:51:48 2011
New Revision: 174413

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=174413
Log:
PR rtl-optimization/49095
* config/i386/predicates.md (plusminuslogic_operator): New predicate.
* config/i386/i386.md: Add peepholes for mem {+,-,,|,^}= x; mem != 0.

* gcc.target/i386/pr49095.c: New test.

Added:
trunk/gcc/testsuite/gcc.target/i386/pr49095.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.md
trunk/gcc/config/i386/predicates.md
trunk/gcc/testsuite/ChangeLog


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-29 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

Jakub Jelinek jakub at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||FIXED

--- Comment #12 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-29 
18:52:43 UTC ---
Fixed.


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-29 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #13 from Linus Torvalds torva...@linux-foundation.org 2011-05-29 
18:56:44 UTC ---
Thanks guys.


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

Jakub Jelinek jakub at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 AssignedTo|unassigned at gcc dot   |jakub at gcc dot gnu.org
   |gnu.org |

--- Comment #4 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-27 
10:35:52 UTC ---
Created attachment 24366
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24366
gcc47-pr49095.patch

Untested patch which solves this (for some cases) using peephole2.

The reason combiner can't do this is that there are no LOG_LINKS in between the
comparison and memory store, those two insns are independent and thus don't
ever show up together in the same try_combine call (together with their
dependencies).  And we generate:
movq(%rsi), %rax
subq$1, %rax
testq%rax, %rax
movq%rax, (%rsi)
je.L4
instead of
movq(%rsi), %rax
subq$1, %rax
movq%rax, (%rsi)
je.L4
because in case of multiple uses, LOG_LINKS are on the first use, and the
memory store is before the test during combine (only later scheduling changes
it).
Thus the test has no LOG_LINKS at all and isn't being attempted to be merged
together with the subtraction.


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

Jakub Jelinek jakub at gcc dot gnu.org changed:

   What|Removed |Added

  Attachment #24366|0   |1
is obsolete||

--- Comment #5 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-27 
11:58:37 UTC ---
Created attachment 24368
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24368
gcc47-pr49095.patch

Improved patch with two new peephole2 patterns, which in addition to the
earlier:
reg0 = mem1; reg0 op= nonmem2; mem1 = reg0; if (reg0 != 0)
optimizes also:
reg0 op= mem1; mem1 = reg0; if (reg0 != 0)
and variant of the first, where all the operations are in QI or HImode, except
for op= which is in SImode.

Again, the patch has been tested just on this testcase so far.


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #6 from Linus Torvalds torva...@linux-foundation.org 2011-05-27 
14:15:25 UTC ---
Jakub - the patch looks sane, although I don't currently have a gcc build
environment to actually test it with, and there is no way I'm going to claim
that I follow all the matches and understand why you have that third pattern
with SWI12 (but I'm guessing it's because the op and the test are being done in
the explicit cast to integer mode).

One thing I wanted to check, though: I hope everybody is aware that

   op $x,mem

is not identically the same as

   mov mem,reg
   op $x, reg
   mov reg,mem
   test reg

WRT the carry flag. The test version will always clear the carry flag, while
the op version will obviously set it according to the op. For the logical
ops, this isn't a problem (they also clear carry), but for add and sub this
transformation *will* result in a difference in the C flag.

Now, carry is meaningless when talking about compare with zero, so this
should all be trivially ok. But I bring it up because it is possible that gcc
perhaps still uses the unsigned compare versions with zero.

In particular, the thing to look out for would be code like

  unsigned int *a;

  if (--*a  0)
do_something();

and if the comparison uses jg (unsigned greater than) for the comparison,
it will get the wrong end result.

Now, unsigned compares against zero are always expressible as ne or eq, so
this is not a fundamental problem. In fact, my trivial testing with a few cases
seems to imply that gcc already does that conversion to inequality, and you'd
obviously have exactly the same issue with eliding the test instruction for
the cases you already handle (when things are in registers).

So I think everything is fine - but I wanted to mention it explicitly in case
it makes you go ooh, yes, there are cases when this is an issue


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #7 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-27 
14:47:08 UTC ---
(In reply to comment #6)
 Jakub - the patch looks sane, although I don't currently have a gcc build
 environment to actually test it with, and there is no way I'm going to claim
 that I follow all the matches and understand why you have that third pattern
 with SWI12 (but I'm guessing it's because the op and the test are being done 
 in
 the explicit cast to integer mode).

E.g. in the testcase from the patch, in hcharplus routine peephole2 sees:
(insn 27 6 28 2 (set (reg:QI 0 ax [orig:62 D.3491 ] [62])
(mem/c:QI (reg/v/f:DI 3 bx [orig:64 x ] [64]) [0 *x_1(D)+0 S1 A8]))
pr49095.c:63 66 {*movqi_internal}
 (nil))

(insn 28 27 8 2 (parallel [
(set (reg:SI 0 ax [orig:62 D.3491 ] [62])
(plus:SI (reg:SI 0 ax [orig:62 D.3491 ] [62])
(const_int 24 [0x18])))
(clobber (reg:CC 17 flags))
]) pr49095.c:63 249 {*addsi_1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))

(insn 8 28 9 2 (set (mem:QI (reg/v/f:DI 3 bx [orig:64 x ] [64]) [0 *x_1(D)+0 S1
A8])
(reg:QI 0 ax [orig:62 D.3491 ] [62])) pr49095.c:63 66 {*movqi_internal}
 (nil))

(insn 9 8 10 2 (set (reg:CCZ 17 flags)
(compare:CCZ (reg:QI 0 ax [orig:62 D.3491 ] [62])
(const_int 0 [0]))) pr49095.c:63 0 {*cmpqi_ccno_1}
 (expr_list:REG_DEAD (reg:QI 0 ax [orig:62 D.3491 ] [62])
(nil)))

It used to be normal QImode addition addqi_1_lea, but it got later on split
to make the code shorter:

;; Avoid redundant prefixes by splitting HImode arithmetic to SImode.

(define_split
  [(set (match_operand 0 register_operand )
(match_operator 3 promotable_binary_operator
   [(match_operand 1 register_operand )
(match_operand 2 aligned_operand )]))
   (clobber (reg:CC FLAGS_REG))]
  ! TARGET_PARTIAL_REG_STALL  reload_completed
((GET_MODE (operands[0]) == HImode
 ((optimize_function_for_speed_p (cfun)  !TARGET_FAST_PREFIX)
/* ??? next two lines just !satisfies_constraint_K (...) */
|| !CONST_INT_P (operands[2])
|| satisfies_constraint_K (operands[2])))
   || (GET_MODE (operands[0]) == QImode   
(TARGET_PROMOTE_QImode || optimize_function_for_size_p (cfun
  [(parallel [(set (match_dup 0)  
   (match_op_dup 3 [(match_dup 1) (match_dup 2)]))
  (clobber (reg:CC FLAGS_REG))])] 
  operands[0] = gen_lowpart (SImode, operands[0]);   
   operands[1] = gen_lowpart (SImode, operands[1]);   
   if (GET_CODE (operands[3]) != ASHIFT)
 operands[2] = gen_lowpart (SImode, operands[2]);
   PUT_MODE (operands[3], SImode);)

 One thing I wanted to check, though: I hope everybody is aware that
 
op $x,mem
 
 is not identically the same as
 
mov mem,reg
op $x, reg
mov reg,mem
test reg
 
 WRT the carry flag. The test version will always clear the carry flag, while
 the op version will obviously set it according to the op. For the logical
 ops, this isn't a problem (they also clear carry), but for add and sub 
 this
 transformation *will* result in a difference in the C flag.
 
 Now, carry is meaningless when talking about compare with zero, so this
 should all be trivially ok. But I bring it up because it is possible that gcc
 perhaps still uses the unsigned compare versions with zero.
 
 In particular, the thing to look out for would be code like
 
   unsigned int *a;
 
   if (--*a  0)
 do_something();
 
 and if the comparison uses jg (unsigned greater than) for the comparison,
 it will get the wrong end result.
 
 Now, unsigned compares against zero are always expressible as ne or eq, so
 this is not a fundamental problem. In fact, my trivial testing with a few 
 cases
 seems to imply that gcc already does that conversion to inequality, and you'd
 obviously have exactly the same issue with eliding the test instruction for
 the cases you already handle (when things are in registers).

That ought to be handled by the
ix86_match_ccmode tests in the peepholes, using CCGOCmode for PLUS/MINUS and
CCNOmode for AND/IOR/XOR.  I've just copied what the actual patterns these
peepholes will turn into use.  The documentation about those modes says:
   Add CCNO to indicate comparisons against zero that requires
   Overflow flag to be unset.  Sign bit test is used instead and
   thus can be used to form ab0 type of tests.

   Add CCGOC to indicate comparisons against zero that allows
   unspecified garbage in the Carry and Overflow flag. This
   mode is used to simulate comparisons of (a-b) and (a+b)
   against zero using sub/cmp/add operations.
and ix86_match_ccmode should check if the test type in which the flags register
is tested uses only the flags that will be set correctly by the operation.  In
your testcase the comparison was done in 

[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #8 from Linus Torvalds torva...@linux-foundation.org 2011-05-27 
15:32:21 UTC ---
(In reply to comment #7)

 BTW, the patch bootstrapped/regtested on both x86_64-linux and i686-linux, I'm
 just running second set of bootstrap without the patch to be able to compare
 how much the patch changed code on gcc itself.

I'd love to hear how well it works - I'm in the middle of a busy merge window
and about to leave for a trip to Japan, otherwise I'd just try to set up a gcc
build myself right now.

(Actually, I have a copy of a git tree, but that one fails with

  /usr/bin/ld: ../libiberty/libiberty.a(argv.o): relocation R_X86_64_32S
against `_sch_istable' can not be used when making a shared object; recompile
with -fPIC

and due to the merge window I don't really have time to try to figure it out)

Thanks,

 Linus


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #9 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-27 
16:33:33 UTC ---
.text sizes before/after the patch (in each case on identical sources, for
cc1plus I've reverted the patch afterwards and did make -j64 cc1plus in gcc/
subdir), the size changes aren't that big, but perhaps other projects use it
more often and/or something is hidden due to alignment.

32-bit before32-bit after64-bit before64-bit after
libstdc++.so.60x717080x716e80x67ee60x67ec6
libgcj.so.120xa3ec580xa3eb980x90cd680x90cce8
cc1plus0xe8a29c0xe8a25c0xdccf980xdccf08

In 64-bit cc1plus it only made a difference in:
c-decl.o
cfg.o
dbxout.o
ebitmap.o
final.o
ira-color.o
jump.o
regcprop.o
reg-stack.o
reload.o
tree-ssa-dom.o
tree-ssa-loop-ivopts.o

Anyway, I'll post the patch now and see what Uros says about it.


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-27 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #10 from Linus Torvalds torva...@linux-foundation.org 2011-05-27 
16:48:52 UTC ---
(In reply to comment #9)
 
 32-bit before32-bit after64-bit before64-bit after
 libstdc++.so.60x717080x716e80x67ee60x67ec6
 libgcj.so.120xa3ec580xa3eb980x90cd680x90cce8
 cc1plus0xe8a29c0xe8a25c0xdccf980xdccf08

Ok, that's much less noticeable than I was hoping for.

That said, even for the kernel, the only reason I noticed this problem was not
because I've seen a lot of them per se, but because the pattern showed up in a
very hot function. In fact, it's the same __rcu_read_unlock() function that I
made the otherwise unrelated bugzilla entry for inlining: 

  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

so it's quite possible that we don't have that many of them in the kernel
either.


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-21 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Target||x86_64-*-*, i?86-*-*
 Status|UNCONFIRMED |NEW
   Keywords||missed-optimization
   Last reconfirmed||2011.05.21 09:52:49
  Component|other   |rtl-optimization
 Ever Confirmed|0   |1

--- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2011-05-21 
09:52:49 UTC ---
Confirmed (not using decq is because it is slower for some archs).  On the
tree level we cannot do better than

  D.2722_2 = *argv_1(D);
  D.2723_3 = D.2722_2 + -1;
  *argv_1(D) = D.2723_3;
  if (D.2723_3 == 0B)

because we lack anything like direct operations on memory (and that's good).
On the RTL side combine tries to do

Trying 7, 8 - 9:
Failed to match this instruction:
(parallel [
(set (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64])
(plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8
A64])
(const_int -1 [0x])))
(set (reg/f:DI 60 [ D.2723 ])
(plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8
A64])
(const_int -1 [0x])))
])

because we have a use of the decrement result in the comparison.  It doesn't
try to combine this with the comparison though.

So this case is really special ;)  Without the use of the decremented
value we get the desired subq $1, (%rsi).

Manually sinking the store to *argv into the if and the else yields

movq(%rsi), %rbx
subq$1, %rbx
je  L4
L2:
movq%rbx, (%rsi)
...

L4:
LCFI4:
movq$0, (%rsi)
movq%rsi, %rdi
movq%rsi, 8(%rsp)
call_fncall

so at least we then can combine the decrement with the test ...

As usual combine doesn't like stores.


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-21 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #2 from Linus Torvalds torva...@linux-foundation.org 2011-05-21 
18:41:15 UTC ---
(In reply to comment #1)

 On the RTL side combine tries to do
 
 Trying 7, 8 - 9:
 Failed to match this instruction:
 (parallel [
 (set (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64])
 (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8
 A64])
 (const_int -1 [0x])))
 (set (reg/f:DI 60 [ D.2723 ])
 (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8
 A64])
 (const_int -1 [0x])))
 ])
 
 because we have a use of the decrement result in the comparison.  It doesn't
 try to combine this with the comparison though.

Why isn't there a trivial pattern for the combination of add+cmp0? It sounds
like a peephole optimization to me.

 So this case is really special ;)  Without the use of the decremented
 value we get the desired subq $1, (%rsi).

The whole notion of decrement and check if zero is just about as special as
mud. 

And I realize that without the check if zero part I get the single rmw
instruction, but I was really hoping that gcc would get this kind of really
obvious code right. There is absolutely no question about what the correct
result is, and gcc simply doesn't generate it.

I'm used to gcc sometimes being confused by more complicated things (inline
asms, bitfields etc), but this is really basic code.

The load-store model is fine for a Pentium 4 - those things were not very good
at complex instructions. But it generates horribly big code, and modern x86
chips all want the operate on memory version.

 Manually sinking the store to *argv into the if and the else yields

Yeah. And that's pretty horrible. 

 As usual combine doesn't like stores.

Is there some reason this can't just be a peephole pattern?

I really thought that gcc has done this before. 

   Linus


[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test

2011-05-21 Thread torva...@linux-foundation.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095

--- Comment #3 from Linus Torvalds torva...@linux-foundation.org 2011-05-21 
20:42:26 UTC ---
Hmm. Looking at that code generation, it strikes me that even with the odd load
store situation, why do we have that test instruction?

   c:8b 10mov(%eax),%edx
   e:83 ea 01 sub$0x1,%edx
  11:85 d2test   %edx,%edx
  13:89 10mov%edx,(%eax)
  15:74 09je 20 main+0x20

iow, regardless of any complexities of the store, that sub + test is just
odd. Gcc knows to simplify that particular sequence in other situations, why
doesn't it simplify it here?

IOW, I can make gcc generate code like

   c:83 e8 01 sub$0x1,%eax
   f:75 07jne18 main+0x18

with no real problem when it's in registers. No test instruction after the
sub. Why does that store matter so much?

It looks like the combine is bring driven by the conditional branch, and then
when the previous instruction from the conditional branch is that store,
everything kind of goes to hell.

Would it be possible to have a peephole for the arithmetic/logical +
compare-with-zero case (causing us to just drop the compare), and then have a
separate peephole optimization that triggers the load + op + store with dead
reg and turns that into a op to mem case?

The MD files do make me confused, so maybe there is some fundamental limitation
to the peephole patterns that makes this impossible?