Re: [PATCH] Try to coalesce for unary and binary ops
On 04/17/14 05:38, Richard Biener wrote: The patch below increases the number of coalescs we attempt to also cover unary and binary operations. This improves initial code generation for code like [ ... ] Expansion has special code to improve coalescing of op1 with target thus this is what we try to match here. Overall there are positive and negative size effects during a bootstrap on x86_64, but overall it seems to be a loss - stage3 cc1 text size is 18261647 bytes without the patch compared to 18265751 bytes with the patch. Now the question is what does this tell us? Not re-using the same pseudo as op and target is always better? Coalescing has always been and probably always will be a transformation that can help or hurt code. At the level where you're applying it I suspect there aren't any really good heuristics to determine when it's likely profitable vs not profitable. I'd tend to lean towards coalescing is good here, but would want to investigate the regressions a bit to see if there's something systematic going on that causes us to generate poor code in some small number of situations that we might be able to fix. You might even initially restrict to just coalescing on unary operators until we have better heuristics on binary operators. Also note all that may end up hurting risc targets. Jeff
Re: [PATCH] Try to coalesce for unary and binary ops
Hi, On Fri, 18 Apr 2014, Steven Bosscher wrote: IMHO TER should be improved to *do* disturb the order of the incoming instructions, to reduce register pressure. The latter is the goal, yes. But TER isn't really the right place for that (it's constrained by too many invariants, running after coalescing and caring only for single-use SSA names e.g). So the alternative would be to have something which is designed for nothing else than reducing register pressure on gimple, and making TER not destroy such schedule. Ciao, Michael.
Re: [PATCH] Try to coalesce for unary and binary ops
Now the question is what does this tell us? Not re-using the same pseudo as op and target is always better? I think that generally better is the most appropriate wording, there is even a specific pass (web.c) to that effect. -- Eric Botcazou
Re: [PATCH] Try to coalesce for unary and binary ops
On Thu, Apr 17, 2014 at 4:00 PM, Michael Matz wrote: And to have sth that TER not immediately un-does we have to disable TER which conveniently happens for coalesced SSA names. So, instead TER should be improved to not disturb the incoming instruction order (except where secondary effects of expanding larger trees can be had). Changing the coalescing set to disable some bad parts in a later pass doesn't sound very convincing :) IMHO TER should be improved to *do* disturb the order of the incoming instructions, to reduce register pressure. There are test cases I've looked at (pathological cases, I'll admit) where TER forwarded loads to stores and blew up register pressure. Alternatively: Do what has to be done to enable sched1 for ix86/x86_64... Ciao! Steven
Re: [PATCH] Try to coalesce for unary and binary ops
On April 18, 2014 1:30:59 PM CEST, Steven Bosscher stevenb@gmail.com wrote: On Thu, Apr 17, 2014 at 4:00 PM, Michael Matz wrote: And to have sth that TER not immediately un-does we have to disable TER which conveniently happens for coalesced SSA names. So, instead TER should be improved to not disturb the incoming instruction order (except where secondary effects of expanding larger trees can be had). Changing the coalescing set to disable some bad parts in a later pass doesn't sound very convincing :) IMHO TER should be improved to *do* disturb the order of the incoming instructions, to reduce register pressure. There are test cases I've looked at (pathological cases, I'll admit) where TER forwarded loads to stores and blew up register pressure. I am looking at doing sth like that by scheduling gimple stmts to that effect before RTL expansion. But TER undos most of the effort unfortunately. TER itself won't be able to perform scheduling due to what it is designed to do (simulate RTL expansion from GENERIC). Richard. Alternatively: Do what has to be done to enable sched1 for ix86/x86_64... Ciao! Steven
[PATCH] Try to coalesce for unary and binary ops
The patch below increases the number of coalescs we attempt to also cover unary and binary operations. This improves initial code generation for code like int foo (int i, int j, int k, int l) { int res = i; res += j; res += k; res += l; return res; } from ;; res_3 = i_1(D) + j_2(D); (insn 9 8 0 (parallel [ (set (reg/v:SI 83 [ res ]) (plus:SI (reg/v:SI 87 [ i ]) (reg/v:SI 88 [ j ]))) (clobber (reg:CC 17 flags)) ]) t.c:4 -1 (nil)) ;; res_5 = res_3 + k_4(D); (insn 10 9 0 (parallel [ (set (reg/v:SI 84 [ res ]) (plus:SI (reg/v:SI 83 [ res ]) (reg/v:SI 89 [ k ]))) (clobber (reg:CC 17 flags)) ]) t.c:5 -1 (nil)) ... to ;; res_3 = i_1(D) + j_2(D); (insn 9 8 0 (parallel [ (set (reg/v:SI 83 [ res ]) (plus:SI (reg/v:SI 85 [ i ]) (reg/v:SI 86 [ j ]))) (clobber (reg:CC 17 flags)) ]) t.c:4 -1 (nil)) ;; res_5 = res_3 + k_4(D); (insn 10 9 0 (parallel [ (set (reg/v:SI 83 [ res ]) (plus:SI (reg/v:SI 83 [ res ]) (reg/v:SI 87 [ k ]))) (clobber (reg:CC 17 flags)) ]) t.c:5 -1 (nil)) re-using the same pseudo for the LHS. Expansion has special code to improve coalescing of op1 with target thus this is what we try to match here. Overall there are positive and negative size effects during a bootstrap on x86_64, but overall it seems to be a loss - stage3 cc1 text size is 18261647 bytes without the patch compared to 18265751 bytes with the patch. Now the question is what does this tell us? Not re-using the same pseudo as op and target is always better? Btw, I tried this to find a convincing metric for a intra-BB scheduling pass (during out-of-SSA) on GIMPLE (to be able to kill that odd scheduling code we now have in reassoc). And to have sth that TER not immediately un-does we have to disable TER which conveniently happens for coalesced SSA names. Thus - schedule for register pressure, and thus reduce SSA name lifetime - with the goal that out-of-SSA can do more coalescing. But it won't even try to coalesce anything else than PHI copies (not affected by scheduling) or plain SSA name copies (shouldn't happen anyway due to copy propagation). So - any ideas? Or is the overall negative for cc1 just an artifact to ignore and we _should_ coalesce as much as possible (even if it doesn't avoid copies - thus the cost of 0 used in the patch)? Otherwise the patch bootstraps and tests fine on x86_64-unknown-linux-gnu. Thanks, Richard. 2014-04-17 Richard Biener rguent...@suse.de * tree-ssa-coalesce.c (create_outofssa_var_map): Try to coalesce SSA name uses with SSA name results in all unary and binary operations. Index: gcc/tree-ssa-coalesce.c === *** gcc/tree-ssa-coalesce.c (revision 209469) --- gcc/tree-ssa-coalesce.c (working copy) *** create_outofssa_var_map (coalesce_list_p *** 991,1007 case GIMPLE_ASSIGN: { tree lhs = gimple_assign_lhs (stmt); tree rhs1 = gimple_assign_rhs1 (stmt); ! if (gimple_assign_ssa_name_copy_p (stmt) gimple_can_coalesce_p (lhs, rhs1)) { v1 = SSA_NAME_VERSION (lhs); v2 = SSA_NAME_VERSION (rhs1); ! cost = coalesce_cost_bb (bb); ! add_coalesce (cl, v1, v2, cost); bitmap_set_bit (used_in_copy, v1); bitmap_set_bit (used_in_copy, v2); } } break; --- 993,1031 case GIMPLE_ASSIGN: { tree lhs = gimple_assign_lhs (stmt); + if (TREE_CODE (lhs) != SSA_NAME) + break; + + /* Expansion handles target == op1 properly and also + target == op2 for commutative binary ops. */ tree rhs1 = gimple_assign_rhs1 (stmt); ! enum tree_code code = gimple_assign_rhs_code (stmt); ! enum gimple_rhs_class klass = get_gimple_rhs_class (code); ! if (TREE_CODE (rhs1) == SSA_NAME gimple_can_coalesce_p (lhs, rhs1)) { v1 = SSA_NAME_VERSION (lhs); v2 = SSA_NAME_VERSION (rhs1); ! add_coalesce (cl, v1, v2, ! klass == GIMPLE_SINGLE_RHS ! ? coalesce_cost_bb (bb) : 0); bitmap_set_bit (used_in_copy, v1); bitmap_set_bit (used_in_copy, v2); } + if (klass == GIMPLE_BINARY_RHS +commutative_tree_code (code)) + { +
Re: [PATCH] Try to coalesce for unary and binary ops
On Thu, 17 Apr 2014, Richard Biener wrote: The patch below increases the number of coalescs we attempt to also cover unary and binary operations. This improves initial code generation for code like int foo (int i, int j, int k, int l) { int res = i; res += j; res += k; res += l; return res; } from ;; res_3 = i_1(D) + j_2(D); (insn 9 8 0 (parallel [ (set (reg/v:SI 83 [ res ]) (plus:SI (reg/v:SI 87 [ i ]) (reg/v:SI 88 [ j ]))) (clobber (reg:CC 17 flags)) ]) t.c:4 -1 (nil)) ;; res_5 = res_3 + k_4(D); (insn 10 9 0 (parallel [ (set (reg/v:SI 84 [ res ]) (plus:SI (reg/v:SI 83 [ res ]) (reg/v:SI 89 [ k ]))) (clobber (reg:CC 17 flags)) ]) t.c:5 -1 (nil)) ... to ;; res_3 = i_1(D) + j_2(D); (insn 9 8 0 (parallel [ (set (reg/v:SI 83 [ res ]) (plus:SI (reg/v:SI 85 [ i ]) (reg/v:SI 86 [ j ]))) (clobber (reg:CC 17 flags)) ]) t.c:4 -1 (nil)) ;; res_5 = res_3 + k_4(D); (insn 10 9 0 (parallel [ (set (reg/v:SI 83 [ res ]) (plus:SI (reg/v:SI 83 [ res ]) (reg/v:SI 87 [ k ]))) (clobber (reg:CC 17 flags)) ]) t.c:5 -1 (nil)) re-using the same pseudo for the LHS. Expansion has special code to improve coalescing of op1 with target thus this is what we try to match here. Overall there are positive and negative size effects during a bootstrap on x86_64, but overall it seems to be a loss - stage3 cc1 text size is 18261647 bytes without the patch compared to 18265751 bytes with the patch. Now the question is what does this tell us? Not re-using the same pseudo as op and target is always better? Btw, I tried this to find a convincing metric for a intra-BB scheduling pass (during out-of-SSA) on GIMPLE (to be able to kill that odd scheduling code we now have in reassoc). And to have sth that TER not immediately un-does we have to disable TER which conveniently happens for coalesced SSA names. Thus - schedule for register pressure, and thus reduce SSA name lifetime - with the goal that out-of-SSA can do more coalescing. But it won't even try to coalesce anything else than PHI copies (not affected by scheduling) or plain SSA name copies (shouldn't happen anyway due to copy propagation). So - any ideas? Or is the overall negative for cc1 just an artifact to ignore and we _should_ coalesce as much as possible (even if it doesn't avoid copies - thus the cost of 0 used in the patch)? One example where it delivers bad initial expansion on x86_64 is int foo (int *p) { int res = p[0]; res += p[1]; res += p[2]; res += p[3]; return res; } where i386.c:ix86_fixup_binary_operands tries to be clever and improve address combine, generating two instructions for (plus:SI (reg/v:SI 83 [ res ]) (mem:SI (...))) and thus triggering expand_binop_directly pat = maybe_gen_insn (icode, 3, ops); if (pat) { /* If PAT is composed of more than one insn, try to add an appropriate REG_EQUAL note to it. If we can't because TEMP conflicts with an operand, call expand_binop again, this time without a target. */ if (INSN_P (pat) NEXT_INSN (pat) != NULL_RTX ! add_equal_note (pat, ops[0].value, optab_to_code (binoptab), ops[1].value, ops[2].value)) { delete_insns_since (last); return expand_binop (mode, binoptab, op0, op1, NULL_RTX, unsignedp, methods); } and thus we end up with (insn 9 6 10 (set (reg:SI 91) (mem:SI (plus:DI (reg/v/f:DI 88 [ p ]) (const_int 4 [0x4])) [0 MEM[(int *)p_2(D) + 4B]+0 S4 A32])) t.c:4 -1 (nil)) (insn 10 9 11 (parallel [ (set (reg:SI 90) (plus:SI (reg/v:SI 83 [ res ]) (reg:SI 91))) (clobber (reg:CC 17 flags)) ]) t.c:4 -1 (expr_list:REG_EQUAL (plus:SI (reg/v:SI 83 [ res ]) (mem:SI (plus:DI (reg/v/f:DI 88 [ p ]) (const_int 4 [0x4])) [0 MEM[(int *)p_2(D) + 4B]+0 S4 A32])) (nil))) (insn 11 10 0 (set (reg/v:SI 83 [ res ]) (reg:SI 90)) t.c:4 -1 (nil)) unpatched we avoid the last move (the tiny testcase of course ends up optimizing the same anyway). Not sure if that strong desire to add a REG_EQUAL note makes up for the losses. At least it looks backwards to the code preceeding it: /* If operation is commutative, try to make the first operand a register. Even better, try to make it the same as the target. Also try to make the last operand a constant. */ if (commutative_p swap_commutative_operands_with_target (target, xop0, xop1)) { swap = xop1; xop1 = xop0; xop0 = swap; }
Re: [PATCH] Try to coalesce for unary and binary ops
Hi, On Thu, 17 Apr 2014, Richard Biener wrote: The patch below increases the number of coalescs we attempt to also cover unary and binary operations. This is not usually a good idea if not mitigated by things like register pressure measurement and using target properties to determine if it's a two- or three-address instruction. It increases register pressure and naturally generates multiple-def pseudos which aren't liked by some of the RTL passes. It will lead to fewer pseudos, so there's a positive side. Now the question is what does this tell us? Not re-using the same pseudo as op and target is always better? No, it tells us that tree-ssa-coalesce is too early for such coalescing. The register allocator is the right spot (or instruction selection if we had that), and it's done there. And to have sth that TER not immediately un-does we have to disable TER which conveniently happens for coalesced SSA names. So, instead TER should be improved to not disturb the incoming instruction order (except where secondary effects of expanding larger trees can be had). Changing the coalescing set to disable some bad parts in a later pass doesn't sound very convincing :) Ciao, Michael.