[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #11 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-29 18:51:51 UTC --- Author: jakub Date: Sun May 29 18:51:48 2011 New Revision: 174413 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=174413 Log: PR rtl-optimization/49095 * config/i386/predicates.md (plusminuslogic_operator): New predicate. * config/i386/i386.md: Add peepholes for mem {+,-,,|,^}= x; mem != 0. * gcc.target/i386/pr49095.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr49095.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md trunk/gcc/config/i386/predicates.md trunk/gcc/testsuite/ChangeLog
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution||FIXED --- Comment #12 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-29 18:52:43 UTC --- Fixed.
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #13 from Linus Torvalds torva...@linux-foundation.org 2011-05-29 18:56:44 UTC --- Thanks guys.
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added Status|NEW |ASSIGNED AssignedTo|unassigned at gcc dot |jakub at gcc dot gnu.org |gnu.org | --- Comment #4 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-27 10:35:52 UTC --- Created attachment 24366 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24366 gcc47-pr49095.patch Untested patch which solves this (for some cases) using peephole2. The reason combiner can't do this is that there are no LOG_LINKS in between the comparison and memory store, those two insns are independent and thus don't ever show up together in the same try_combine call (together with their dependencies). And we generate: movq(%rsi), %rax subq$1, %rax testq%rax, %rax movq%rax, (%rsi) je.L4 instead of movq(%rsi), %rax subq$1, %rax movq%rax, (%rsi) je.L4 because in case of multiple uses, LOG_LINKS are on the first use, and the memory store is before the test during combine (only later scheduling changes it). Thus the test has no LOG_LINKS at all and isn't being attempted to be merged together with the subtraction.
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added Attachment #24366|0 |1 is obsolete|| --- Comment #5 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-27 11:58:37 UTC --- Created attachment 24368 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24368 gcc47-pr49095.patch Improved patch with two new peephole2 patterns, which in addition to the earlier: reg0 = mem1; reg0 op= nonmem2; mem1 = reg0; if (reg0 != 0) optimizes also: reg0 op= mem1; mem1 = reg0; if (reg0 != 0) and variant of the first, where all the operations are in QI or HImode, except for op= which is in SImode. Again, the patch has been tested just on this testcase so far.
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #6 from Linus Torvalds torva...@linux-foundation.org 2011-05-27 14:15:25 UTC --- Jakub - the patch looks sane, although I don't currently have a gcc build environment to actually test it with, and there is no way I'm going to claim that I follow all the matches and understand why you have that third pattern with SWI12 (but I'm guessing it's because the op and the test are being done in the explicit cast to integer mode). One thing I wanted to check, though: I hope everybody is aware that op $x,mem is not identically the same as mov mem,reg op $x, reg mov reg,mem test reg WRT the carry flag. The test version will always clear the carry flag, while the op version will obviously set it according to the op. For the logical ops, this isn't a problem (they also clear carry), but for add and sub this transformation *will* result in a difference in the C flag. Now, carry is meaningless when talking about compare with zero, so this should all be trivially ok. But I bring it up because it is possible that gcc perhaps still uses the unsigned compare versions with zero. In particular, the thing to look out for would be code like unsigned int *a; if (--*a 0) do_something(); and if the comparison uses jg (unsigned greater than) for the comparison, it will get the wrong end result. Now, unsigned compares against zero are always expressible as ne or eq, so this is not a fundamental problem. In fact, my trivial testing with a few cases seems to imply that gcc already does that conversion to inequality, and you'd obviously have exactly the same issue with eliding the test instruction for the cases you already handle (when things are in registers). So I think everything is fine - but I wanted to mention it explicitly in case it makes you go ooh, yes, there are cases when this is an issue
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #7 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-27 14:47:08 UTC --- (In reply to comment #6) Jakub - the patch looks sane, although I don't currently have a gcc build environment to actually test it with, and there is no way I'm going to claim that I follow all the matches and understand why you have that third pattern with SWI12 (but I'm guessing it's because the op and the test are being done in the explicit cast to integer mode). E.g. in the testcase from the patch, in hcharplus routine peephole2 sees: (insn 27 6 28 2 (set (reg:QI 0 ax [orig:62 D.3491 ] [62]) (mem/c:QI (reg/v/f:DI 3 bx [orig:64 x ] [64]) [0 *x_1(D)+0 S1 A8])) pr49095.c:63 66 {*movqi_internal} (nil)) (insn 28 27 8 2 (parallel [ (set (reg:SI 0 ax [orig:62 D.3491 ] [62]) (plus:SI (reg:SI 0 ax [orig:62 D.3491 ] [62]) (const_int 24 [0x18]))) (clobber (reg:CC 17 flags)) ]) pr49095.c:63 249 {*addsi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (insn 8 28 9 2 (set (mem:QI (reg/v/f:DI 3 bx [orig:64 x ] [64]) [0 *x_1(D)+0 S1 A8]) (reg:QI 0 ax [orig:62 D.3491 ] [62])) pr49095.c:63 66 {*movqi_internal} (nil)) (insn 9 8 10 2 (set (reg:CCZ 17 flags) (compare:CCZ (reg:QI 0 ax [orig:62 D.3491 ] [62]) (const_int 0 [0]))) pr49095.c:63 0 {*cmpqi_ccno_1} (expr_list:REG_DEAD (reg:QI 0 ax [orig:62 D.3491 ] [62]) (nil))) It used to be normal QImode addition addqi_1_lea, but it got later on split to make the code shorter: ;; Avoid redundant prefixes by splitting HImode arithmetic to SImode. (define_split [(set (match_operand 0 register_operand ) (match_operator 3 promotable_binary_operator [(match_operand 1 register_operand ) (match_operand 2 aligned_operand )])) (clobber (reg:CC FLAGS_REG))] ! TARGET_PARTIAL_REG_STALL reload_completed ((GET_MODE (operands[0]) == HImode ((optimize_function_for_speed_p (cfun) !TARGET_FAST_PREFIX) /* ??? next two lines just !satisfies_constraint_K (...) */ || !CONST_INT_P (operands[2]) || satisfies_constraint_K (operands[2]))) || (GET_MODE (operands[0]) == QImode (TARGET_PROMOTE_QImode || optimize_function_for_size_p (cfun [(parallel [(set (match_dup 0) (match_op_dup 3 [(match_dup 1) (match_dup 2)])) (clobber (reg:CC FLAGS_REG))])] operands[0] = gen_lowpart (SImode, operands[0]); operands[1] = gen_lowpart (SImode, operands[1]); if (GET_CODE (operands[3]) != ASHIFT) operands[2] = gen_lowpart (SImode, operands[2]); PUT_MODE (operands[3], SImode);) One thing I wanted to check, though: I hope everybody is aware that op $x,mem is not identically the same as mov mem,reg op $x, reg mov reg,mem test reg WRT the carry flag. The test version will always clear the carry flag, while the op version will obviously set it according to the op. For the logical ops, this isn't a problem (they also clear carry), but for add and sub this transformation *will* result in a difference in the C flag. Now, carry is meaningless when talking about compare with zero, so this should all be trivially ok. But I bring it up because it is possible that gcc perhaps still uses the unsigned compare versions with zero. In particular, the thing to look out for would be code like unsigned int *a; if (--*a 0) do_something(); and if the comparison uses jg (unsigned greater than) for the comparison, it will get the wrong end result. Now, unsigned compares against zero are always expressible as ne or eq, so this is not a fundamental problem. In fact, my trivial testing with a few cases seems to imply that gcc already does that conversion to inequality, and you'd obviously have exactly the same issue with eliding the test instruction for the cases you already handle (when things are in registers). That ought to be handled by the ix86_match_ccmode tests in the peepholes, using CCGOCmode for PLUS/MINUS and CCNOmode for AND/IOR/XOR. I've just copied what the actual patterns these peepholes will turn into use. The documentation about those modes says: Add CCNO to indicate comparisons against zero that requires Overflow flag to be unset. Sign bit test is used instead and thus can be used to form ab0 type of tests. Add CCGOC to indicate comparisons against zero that allows unspecified garbage in the Carry and Overflow flag. This mode is used to simulate comparisons of (a-b) and (a+b) against zero using sub/cmp/add operations. and ix86_match_ccmode should check if the test type in which the flags register is tested uses only the flags that will be set correctly by the operation. In your testcase the comparison was done in
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #8 from Linus Torvalds torva...@linux-foundation.org 2011-05-27 15:32:21 UTC --- (In reply to comment #7) BTW, the patch bootstrapped/regtested on both x86_64-linux and i686-linux, I'm just running second set of bootstrap without the patch to be able to compare how much the patch changed code on gcc itself. I'd love to hear how well it works - I'm in the middle of a busy merge window and about to leave for a trip to Japan, otherwise I'd just try to set up a gcc build myself right now. (Actually, I have a copy of a git tree, but that one fails with /usr/bin/ld: ../libiberty/libiberty.a(argv.o): relocation R_X86_64_32S against `_sch_istable' can not be used when making a shared object; recompile with -fPIC and due to the merge window I don't really have time to try to figure it out) Thanks, Linus
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #9 from Jakub Jelinek jakub at gcc dot gnu.org 2011-05-27 16:33:33 UTC --- .text sizes before/after the patch (in each case on identical sources, for cc1plus I've reverted the patch afterwards and did make -j64 cc1plus in gcc/ subdir), the size changes aren't that big, but perhaps other projects use it more often and/or something is hidden due to alignment. 32-bit before32-bit after64-bit before64-bit after libstdc++.so.60x717080x716e80x67ee60x67ec6 libgcj.so.120xa3ec580xa3eb980x90cd680x90cce8 cc1plus0xe8a29c0xe8a25c0xdccf980xdccf08 In 64-bit cc1plus it only made a difference in: c-decl.o cfg.o dbxout.o ebitmap.o final.o ira-color.o jump.o regcprop.o reg-stack.o reload.o tree-ssa-dom.o tree-ssa-loop-ivopts.o Anyway, I'll post the patch now and see what Uros says about it.
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #10 from Linus Torvalds torva...@linux-foundation.org 2011-05-27 16:48:52 UTC --- (In reply to comment #9) 32-bit before32-bit after64-bit before64-bit after libstdc++.so.60x717080x716e80x67ee60x67ec6 libgcj.so.120xa3ec580xa3eb980x90cd680x90cce8 cc1plus0xe8a29c0xe8a25c0xdccf980xdccf08 Ok, that's much less noticeable than I was hoping for. That said, even for the kernel, the only reason I noticed this problem was not because I've seen a lot of them per se, but because the pattern showed up in a very hot function. In fact, it's the same __rcu_read_unlock() function that I made the otherwise unrelated bugzilla entry for inlining: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194 so it's quite possible that we don't have that many of them in the kernel either.
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Target||x86_64-*-*, i?86-*-* Status|UNCONFIRMED |NEW Keywords||missed-optimization Last reconfirmed||2011.05.21 09:52:49 Component|other |rtl-optimization Ever Confirmed|0 |1 --- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2011-05-21 09:52:49 UTC --- Confirmed (not using decq is because it is slower for some archs). On the tree level we cannot do better than D.2722_2 = *argv_1(D); D.2723_3 = D.2722_2 + -1; *argv_1(D) = D.2723_3; if (D.2723_3 == 0B) because we lack anything like direct operations on memory (and that's good). On the RTL side combine tries to do Trying 7, 8 - 9: Failed to match this instruction: (parallel [ (set (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64]) (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64]) (const_int -1 [0x]))) (set (reg/f:DI 60 [ D.2723 ]) (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64]) (const_int -1 [0x]))) ]) because we have a use of the decrement result in the comparison. It doesn't try to combine this with the comparison though. So this case is really special ;) Without the use of the decremented value we get the desired subq $1, (%rsi). Manually sinking the store to *argv into the if and the else yields movq(%rsi), %rbx subq$1, %rbx je L4 L2: movq%rbx, (%rsi) ... L4: LCFI4: movq$0, (%rsi) movq%rsi, %rdi movq%rsi, 8(%rsp) call_fncall so at least we then can combine the decrement with the test ... As usual combine doesn't like stores.
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #2 from Linus Torvalds torva...@linux-foundation.org 2011-05-21 18:41:15 UTC --- (In reply to comment #1) On the RTL side combine tries to do Trying 7, 8 - 9: Failed to match this instruction: (parallel [ (set (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64]) (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64]) (const_int -1 [0x]))) (set (reg/f:DI 60 [ D.2723 ]) (plus:DI (mem/f:DI (reg/v/f:DI 63 [ argv ]) [2 *argv_1(D)+0 S8 A64]) (const_int -1 [0x]))) ]) because we have a use of the decrement result in the comparison. It doesn't try to combine this with the comparison though. Why isn't there a trivial pattern for the combination of add+cmp0? It sounds like a peephole optimization to me. So this case is really special ;) Without the use of the decremented value we get the desired subq $1, (%rsi). The whole notion of decrement and check if zero is just about as special as mud. And I realize that without the check if zero part I get the single rmw instruction, but I was really hoping that gcc would get this kind of really obvious code right. There is absolutely no question about what the correct result is, and gcc simply doesn't generate it. I'm used to gcc sometimes being confused by more complicated things (inline asms, bitfields etc), but this is really basic code. The load-store model is fine for a Pentium 4 - those things were not very good at complex instructions. But it generates horribly big code, and modern x86 chips all want the operate on memory version. Manually sinking the store to *argv into the if and the else yields Yeah. And that's pretty horrible. As usual combine doesn't like stores. Is there some reason this can't just be a peephole pattern? I really thought that gcc has done this before. Linus
[Bug rtl-optimization/49095] Horrible code generation for trivial decrement with test
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49095 --- Comment #3 from Linus Torvalds torva...@linux-foundation.org 2011-05-21 20:42:26 UTC --- Hmm. Looking at that code generation, it strikes me that even with the odd load store situation, why do we have that test instruction? c:8b 10mov(%eax),%edx e:83 ea 01 sub$0x1,%edx 11:85 d2test %edx,%edx 13:89 10mov%edx,(%eax) 15:74 09je 20 main+0x20 iow, regardless of any complexities of the store, that sub + test is just odd. Gcc knows to simplify that particular sequence in other situations, why doesn't it simplify it here? IOW, I can make gcc generate code like c:83 e8 01 sub$0x1,%eax f:75 07jne18 main+0x18 with no real problem when it's in registers. No test instruction after the sub. Why does that store matter so much? It looks like the combine is bring driven by the conditional branch, and then when the previous instruction from the conditional branch is that store, everything kind of goes to hell. Would it be possible to have a peephole for the arithmetic/logical + compare-with-zero case (causing us to just drop the compare), and then have a separate peephole optimization that triggers the load + op + store with dead reg and turns that into a op to mem case? The MD files do make me confused, so maybe there is some fundamental limitation to the peephole patterns that makes this impossible?