Re: [PATCH] Try to coalesce for unary and binary ops

2014-05-02 Thread Jeff Law

On 04/17/14 05:38, Richard Biener wrote:


The patch below increases the number of coalescs we attempt
to also cover unary and binary operations.  This improves
initial code generation for code like

[ ... ]



Expansion has special code to improve coalescing of op1 with
target thus this is what we try to match here.

Overall there are positive and negative size effects during
a bootstrap on x86_64, but overall it seems to be a loss
- stage3 cc1 text size is 18261647 bytes without the patch
compared to 18265751 bytes with the patch.

Now the question is what does this tell us?  Not re-using
the same pseudo as op and target is always better?
Coalescing has always been and probably always will be a transformation 
that can help or hurt code.  At the level where you're applying it I 
suspect there aren't any really good heuristics to determine when it's 
likely profitable vs not profitable.


I'd tend to lean towards coalescing is good here, but would want to 
investigate the regressions a bit to see if there's something systematic 
going on that causes us to generate poor code in some small number of 
situations that we might be able to fix.


You might even initially restrict to just coalescing on unary operators 
until we have better heuristics on binary operators.


Also note all that may end up hurting risc targets.

Jeff


Re: [PATCH] Try to coalesce for unary and binary ops

2014-04-22 Thread Michael Matz
Hi,

On Fri, 18 Apr 2014, Steven Bosscher wrote:

 IMHO TER should be improved to *do* disturb the order of the incoming
 instructions, to reduce register pressure.

The latter is the goal, yes.  But TER isn't really the right place for 
that (it's constrained by too many invariants, running after coalescing 
and caring only for single-use SSA names e.g).  So the alternative would 
be to have something which is designed for nothing else than reducing 
register pressure on gimple, and making TER not destroy such schedule.


Ciao,
Michael.


Re: [PATCH] Try to coalesce for unary and binary ops

2014-04-18 Thread Eric Botcazou
 Now the question is what does this tell us?  Not re-using
 the same pseudo as op and target is always better?

I think that generally better is the most appropriate wording, there is even 
a specific pass (web.c) to that effect.

-- 
Eric Botcazou


Re: [PATCH] Try to coalesce for unary and binary ops

2014-04-18 Thread Steven Bosscher
On Thu, Apr 17, 2014 at 4:00 PM, Michael Matz wrote:
 And to have sth that TER not immediately un-does we have
 to disable TER which conveniently happens for coalesced
 SSA names.

 So, instead TER should be improved to not disturb the incoming instruction
 order (except where secondary effects of expanding larger trees can be
 had).  Changing the coalescing set to disable some bad parts in a later
 pass doesn't sound very convincing :)

IMHO TER should be improved to *do* disturb the order of the incoming
instructions, to reduce register pressure. There are test cases I've
looked at (pathological cases, I'll admit) where TER forwarded loads
to stores and blew up register pressure.

Alternatively: Do what has to be done to enable sched1 for ix86/x86_64...

Ciao!
Steven


Re: [PATCH] Try to coalesce for unary and binary ops

2014-04-18 Thread Richard Biener
On April 18, 2014 1:30:59 PM CEST, Steven Bosscher stevenb@gmail.com 
wrote:
On Thu, Apr 17, 2014 at 4:00 PM, Michael Matz wrote:
 And to have sth that TER not immediately un-does we have
 to disable TER which conveniently happens for coalesced
 SSA names.

 So, instead TER should be improved to not disturb the incoming
instruction
 order (except where secondary effects of expanding larger trees can
be
 had).  Changing the coalescing set to disable some bad parts in a
later
 pass doesn't sound very convincing :)

IMHO TER should be improved to *do* disturb the order of the incoming
instructions, to reduce register pressure. There are test cases I've
looked at (pathological cases, I'll admit) where TER forwarded loads
to stores and blew up register pressure.

I am looking at doing sth like that by scheduling gimple stmts to that effect 
before RTL expansion. But TER undos most of the effort unfortunately. TER 
itself won't be able to perform scheduling due to what it is designed to do 
(simulate RTL expansion from GENERIC).

Richard.

Alternatively: Do what has to be done to enable sched1 for
ix86/x86_64...

Ciao!
Steven




[PATCH] Try to coalesce for unary and binary ops

2014-04-17 Thread Richard Biener

The patch below increases the number of coalescs we attempt
to also cover unary and binary operations.  This improves
initial code generation for code like

int foo (int i, int j, int k, int l)
{
  int res = i;
  res += j;
  res += k;
  res += l;
  return res;
}

from

;; res_3 = i_1(D) + j_2(D);

(insn 9 8 0 (parallel [
(set (reg/v:SI 83 [ res ])
(plus:SI (reg/v:SI 87 [ i ])
(reg/v:SI 88 [ j ])))
(clobber (reg:CC 17 flags))
]) t.c:4 -1
 (nil))

;; res_5 = res_3 + k_4(D);

(insn 10 9 0 (parallel [
(set (reg/v:SI 84 [ res ])
(plus:SI (reg/v:SI 83 [ res ])
(reg/v:SI 89 [ k ])))
(clobber (reg:CC 17 flags))
]) t.c:5 -1
 (nil))
...

to

;; res_3 = i_1(D) + j_2(D);

(insn 9 8 0 (parallel [
(set (reg/v:SI 83 [ res ])
(plus:SI (reg/v:SI 85 [ i ])
(reg/v:SI 86 [ j ])))
(clobber (reg:CC 17 flags))
]) t.c:4 -1
 (nil))

;; res_5 = res_3 + k_4(D);

(insn 10 9 0 (parallel [
(set (reg/v:SI 83 [ res ])
(plus:SI (reg/v:SI 83 [ res ])
(reg/v:SI 87 [ k ])))
(clobber (reg:CC 17 flags))
]) t.c:5 -1
 (nil))

re-using the same pseudo for the LHS.

Expansion has special code to improve coalescing of op1 with
target thus this is what we try to match here.

Overall there are positive and negative size effects during
a bootstrap on x86_64, but overall it seems to be a loss
- stage3 cc1 text size is 18261647 bytes without the patch
compared to 18265751 bytes with the patch.

Now the question is what does this tell us?  Not re-using
the same pseudo as op and target is always better?

Btw, I tried this to find a convincing metric for a intra-BB
scheduling pass (during out-of-SSA) on GIMPLE (to be able
to kill that odd scheduling code we now have in reassoc).
And to have sth that TER not immediately un-does we have
to disable TER which conveniently happens for coalesced
SSA names.  Thus - schedule for register pressure, and thus
reduce SSA name lifetime - with the goal that out-of-SSA can
do more coalescing.  But it won't even try to coalesce
anything else than PHI copies (not affected by scheduling)
or plain SSA name copies (shouldn't happen anyway due to
copy propagation).

So - any ideas?  Or is the overall negative for cc1 just
an artifact to ignore and we _should_ coalesce as much
as possible (even if it doesn't avoid copies - thus the
cost of 0 used in the patch)?

Otherwise the patch bootstraps and tests fine on x86_64-unknown-linux-gnu.

Thanks,
Richard.

2014-04-17  Richard Biener  rguent...@suse.de

* tree-ssa-coalesce.c (create_outofssa_var_map): Try to
coalesce SSA name uses with SSA name results in all unary
and binary operations.

Index: gcc/tree-ssa-coalesce.c
===
*** gcc/tree-ssa-coalesce.c (revision 209469)
--- gcc/tree-ssa-coalesce.c (working copy)
*** create_outofssa_var_map (coalesce_list_p
*** 991,1007 
case GIMPLE_ASSIGN:
  {
tree lhs = gimple_assign_lhs (stmt);
tree rhs1 = gimple_assign_rhs1 (stmt);
!   if (gimple_assign_ssa_name_copy_p (stmt)
 gimple_can_coalesce_p (lhs, rhs1))
  {
v1 = SSA_NAME_VERSION (lhs);
v2 = SSA_NAME_VERSION (rhs1);
!   cost = coalesce_cost_bb (bb);
!   add_coalesce (cl, v1, v2, cost);
bitmap_set_bit (used_in_copy, v1);
bitmap_set_bit (used_in_copy, v2);
  }
  }
  break;
  
--- 993,1031 
case GIMPLE_ASSIGN:
  {
tree lhs = gimple_assign_lhs (stmt);
+   if (TREE_CODE (lhs) != SSA_NAME)
+ break;
+ 
+   /* Expansion handles target == op1 properly and also
+  target == op2 for commutative binary ops.  */
tree rhs1 = gimple_assign_rhs1 (stmt);
!   enum tree_code code = gimple_assign_rhs_code (stmt);
!   enum gimple_rhs_class klass = get_gimple_rhs_class (code);
!   if (TREE_CODE (rhs1) == SSA_NAME
 gimple_can_coalesce_p (lhs, rhs1))
  {
v1 = SSA_NAME_VERSION (lhs);
v2 = SSA_NAME_VERSION (rhs1);
!   add_coalesce (cl, v1, v2,
! klass == GIMPLE_SINGLE_RHS
! ? coalesce_cost_bb (bb) : 0);
bitmap_set_bit (used_in_copy, v1);
bitmap_set_bit (used_in_copy, v2);
  }
+   if (klass == GIMPLE_BINARY_RHS
+commutative_tree_code (code))
+ {
+   

Re: [PATCH] Try to coalesce for unary and binary ops

2014-04-17 Thread Richard Biener
On Thu, 17 Apr 2014, Richard Biener wrote:

 
 The patch below increases the number of coalescs we attempt
 to also cover unary and binary operations.  This improves
 initial code generation for code like
 
 int foo (int i, int j, int k, int l)
 {
   int res = i;
   res += j;
   res += k;
   res += l;
   return res;
 }
 
 from
 
 ;; res_3 = i_1(D) + j_2(D);
 
 (insn 9 8 0 (parallel [
 (set (reg/v:SI 83 [ res ])
 (plus:SI (reg/v:SI 87 [ i ])
 (reg/v:SI 88 [ j ])))
 (clobber (reg:CC 17 flags))
 ]) t.c:4 -1
  (nil))
 
 ;; res_5 = res_3 + k_4(D);
 
 (insn 10 9 0 (parallel [
 (set (reg/v:SI 84 [ res ])
 (plus:SI (reg/v:SI 83 [ res ])
 (reg/v:SI 89 [ k ])))
 (clobber (reg:CC 17 flags))
 ]) t.c:5 -1
  (nil))
 ...
 
 to
 
 ;; res_3 = i_1(D) + j_2(D);
 
 (insn 9 8 0 (parallel [
 (set (reg/v:SI 83 [ res ])
 (plus:SI (reg/v:SI 85 [ i ])
 (reg/v:SI 86 [ j ])))
 (clobber (reg:CC 17 flags))
 ]) t.c:4 -1
  (nil))
 
 ;; res_5 = res_3 + k_4(D);
 
 (insn 10 9 0 (parallel [
 (set (reg/v:SI 83 [ res ])
 (plus:SI (reg/v:SI 83 [ res ])
 (reg/v:SI 87 [ k ])))
 (clobber (reg:CC 17 flags))
 ]) t.c:5 -1
  (nil))
 
 re-using the same pseudo for the LHS.
 
 Expansion has special code to improve coalescing of op1 with
 target thus this is what we try to match here.
 
 Overall there are positive and negative size effects during
 a bootstrap on x86_64, but overall it seems to be a loss
 - stage3 cc1 text size is 18261647 bytes without the patch
 compared to 18265751 bytes with the patch.
 
 Now the question is what does this tell us?  Not re-using
 the same pseudo as op and target is always better?
 
 Btw, I tried this to find a convincing metric for a intra-BB
 scheduling pass (during out-of-SSA) on GIMPLE (to be able
 to kill that odd scheduling code we now have in reassoc).
 And to have sth that TER not immediately un-does we have
 to disable TER which conveniently happens for coalesced
 SSA names.  Thus - schedule for register pressure, and thus
 reduce SSA name lifetime - with the goal that out-of-SSA can
 do more coalescing.  But it won't even try to coalesce
 anything else than PHI copies (not affected by scheduling)
 or plain SSA name copies (shouldn't happen anyway due to
 copy propagation).
 
 So - any ideas?  Or is the overall negative for cc1 just
 an artifact to ignore and we _should_ coalesce as much
 as possible (even if it doesn't avoid copies - thus the
 cost of 0 used in the patch)?

One example where it delivers bad initial expansion on x86_64 is

int foo (int *p)
{
  int res = p[0];
  res += p[1];
  res += p[2];
  res += p[3];
  return res;
}

where i386.c:ix86_fixup_binary_operands tries to be clever
and improve address combine, generating two instructions
for (plus:SI (reg/v:SI 83 [ res ]) (mem:SI (...))) and thus
triggering expand_binop_directly

  pat = maybe_gen_insn (icode, 3, ops);
  if (pat)
{
  /* If PAT is composed of more than one insn, try to add an 
appropriate
 REG_EQUAL note to it.  If we can't because TEMP conflicts with an
 operand, call expand_binop again, this time without a target.  */
  if (INSN_P (pat)  NEXT_INSN (pat) != NULL_RTX
   ! add_equal_note (pat, ops[0].value, optab_to_code 
(binoptab),
   ops[1].value, ops[2].value))
{
  delete_insns_since (last);
  return expand_binop (mode, binoptab, op0, op1, NULL_RTX,
   unsignedp, methods);
}

and thus we end up with

(insn 9 6 10 (set (reg:SI 91)
(mem:SI (plus:DI (reg/v/f:DI 88 [ p ])
(const_int 4 [0x4])) [0 MEM[(int *)p_2(D) + 4B]+0 S4 
A32])) t.c:4 -1
 (nil))

(insn 10 9 11 (parallel [
(set (reg:SI 90)
(plus:SI (reg/v:SI 83 [ res ])
(reg:SI 91)))
(clobber (reg:CC 17 flags))
]) t.c:4 -1
 (expr_list:REG_EQUAL (plus:SI (reg/v:SI 83 [ res ])
(mem:SI (plus:DI (reg/v/f:DI 88 [ p ])
(const_int 4 [0x4])) [0 MEM[(int *)p_2(D) + 4B]+0 S4 
A32]))
(nil)))

(insn 11 10 0 (set (reg/v:SI 83 [ res ])
(reg:SI 90)) t.c:4 -1
 (nil))

unpatched we avoid the last move (the tiny testcase of course
ends up optimizing the same anyway).  Not sure if that strong
desire to add a REG_EQUAL note makes up for the losses.  At
least it looks backwards to the code preceeding it:

  /* If operation is commutative,
 try to make the first operand a register.
 Even better, try to make it the same as the target.
 Also try to make the last operand a constant.  */
  if (commutative_p
   swap_commutative_operands_with_target (target, xop0, xop1))
{
  swap = xop1;
  xop1 = xop0;
  xop0 = swap;
}

Re: [PATCH] Try to coalesce for unary and binary ops

2014-04-17 Thread Michael Matz
Hi,

On Thu, 17 Apr 2014, Richard Biener wrote:

 The patch below increases the number of coalescs we attempt
 to also cover unary and binary operations.

This is not usually a good idea if not mitigated by things like register 
pressure measurement and using target properties to determine if it's a 
two- or three-address instruction.  It increases register pressure and 
naturally generates multiple-def pseudos which aren't liked by some of the 
RTL passes.  It will lead to fewer pseudos, so there's a positive side.

 Now the question is what does this tell us?  Not re-using the same 
 pseudo as op and target is always better?

No, it tells us that tree-ssa-coalesce is too early for such coalescing.  
The register allocator is the right spot (or instruction selection if we 
had that), and it's done there.

 And to have sth that TER not immediately un-does we have
 to disable TER which conveniently happens for coalesced
 SSA names.

So, instead TER should be improved to not disturb the incoming instruction 
order (except where secondary effects of expanding larger trees can be 
had).  Changing the coalescing set to disable some bad parts in a later 
pass doesn't sound very convincing :)


Ciao,
Michael.