[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0

2016-03-24 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

--- Comment #14 from Jeffrey A. Law  ---
Some further notes.

I was looking at what the impact would be if we just stopped recording the
problematical equivalences in CSE, both to see if the equivalences are useful
at all, and if they are, to get a sense of when (which might perhaps lead to
some useful conditions for recording them).  I was quite surprised at how much
of a difference in the resulting code generation these equivalences make.

One of the things that is emerging is that these equivalences are useful when
the copy propagation they enable allows one operand to die at the comparison
when it didn't die before.  That in turn may allow creation of REG_EQUIV note
on the insn that initializes the dying register.  Which then allows
substitution of the equivalent memory for the register in the comparison.

We still have the same number of memory references, but we use one less
register in that case and have one less instruction.

We obviously don't have that level of insight during CSE.  But given uses/sets
DF info, we can get a lot of the way there.

Anyway, just wanted to record some of my findings.  I'm putting this down for
now as it's not likely to be a gcc-6 kind of change.

[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0

2016-03-23 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

Jeffrey A. Law  changed:

   What|Removed |Added

   Priority|P1  |P2

--- Comment #13 from Jeffrey A. Law  ---
Essentially this is the same problem we have with DOM using context sensitive
equivalences to copy propagate into subgraphs, but in CSE.  I'm increasingly of
the opinion that such equivalences DOM find should be used for simplification
only, not for copy propagation.  That opinion would apply for CSE as well.

I'm not sure if we can put the pieces in place for gcc-6, but I think that's
the direction we ought to be going.

The alternative would be to do some kind of range splitting.  What we'd want to
know is do we have a context sensitive equivalency and would splitting the
range in the dominated subgraph result in a graph that is more easily/better
colorable.  In this case, the subgraph creates all the conflicts so it's an
obvious split point, but I'm not sure how easily we could generalize that.

Either way I don't think this should be a release blocking issue.  Moving to
P2, but keeping the target gcc-6.

[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0

2016-03-23 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

--- Comment #11 from Jeffrey A. Law  ---
So given the conflicts during IRA I can't see a way for IRA to do a better job.
 Essentially the key allocno/pseudo wants hard reg 0 to avoid the spillage, but
it also conflicts with hard reg 0.

Prior to CSE1 we have the following key statements:

(insn 15 14 16 2 (set (reg/v/f:SI 116 [  ])
(reg:SI 0 r0)) j.c:23 748 {*thumb1_movsi_insn}
 (nil))

[ ... 
(jump_insn 19 18 20 2 (set (pc)
(if_then_else (ne (reg/v/f:SI 117 [ line ])
(reg/v/f:SI 116 [  ]))
(label_ref:SI 36)
(pc))) j.c:25 756 {cbranchsi4_insn}
 (int_list:REG_BR_PROB 8987 (nil))
 -> 36)

[ ... ]

(insn 21 20 22 3 (set (reg:SI 0 r0)
(reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn}
 (nil))

[ ... ]
(insn 31 30 36 3 (set (reg/v/f:SI 116 [  ])
(reg:SI 0 r0)) j.c:25 748 {*thumb1_movsi_insn}
 (nil))
[ ... ]

(insn 28 27 29 3 (set (reg:SI 1 r1)
(reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn}
 (nil))

[ ... ]
(insn 37 39 38 4 (set (reg/i:SI 0 r0)
(reg/v/f:SI 116 [  ])) j.c:26 748 {*thumb1_movsi_insn}
 (nil))
(insn 38 37 0 4 (use (reg/i:SI 0 r0)) j.c:26 -1
 (nil))


Of particular interest is that r116 is not live-in to bb3.  If you do the full
analysis, it can be shown that r116 does not conflict with r0 before cse1.  And
that's key because to get the code we want r116 needs to be assigned to r0.

cse (like DOM) has the ability to look at a equality conditional and propagate
equivalences into the true or false arm.  ANd if we look at insn 19, we've got
a equality conditional between r117 and r116 which will set up an equivalence
between r116 and r117 for bb3.

So in bb3, cse1 will change the initial assignment from:

(insn 21 20 22 3 (set (reg:SI 0 r0)
(reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn}
 (nil))

to:

(insn 21 20 22 3 (set (reg:SI 0 r0)
(reg/v/f:SI 116 [  ])) j.c:25 748 {*thumb1_movsi_insn}
 (nil))


Which makes r116 live-in for bb3.  But note that it doesn't change insn 28
(yet).

forwprop then comes along and changes insn 28 to look like:
(insn 28 27 29 3 (set (reg:SI 1 r1)
(reg/v/f:SI 116 [  ])) j.c:25 748 {*thumb1_movsi_insn}
 (expr_list:REG_DEAD (reg/v/f:SI 116 [  ])
(nil)))

Now r116 is both live-in to bb3 and conflicts with r0 within bb3.

At which point we have lost.

I've got a couple things I want to poke at...  But nothing that I think has a
high probability of success.

[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0

2016-03-23 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

--- Comment #12 from Jeffrey A. Law  ---
Slight correction.  I was looking at the wrong part of the dump when I said
cse1 didn't change insn 28.  It is cse1 that changes insn 28.  So this is
strictly an issue with the transformations cse1 makes.

[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0

2016-03-23 Thread vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

--- Comment #10 from Vladimir Makarov  ---
(In reply to Jeffrey A. Law from comment #9)
> I think that's a fair characterization.  The extra copy emitted by the older
> compiler gives the allocator more freedom.   With coalescing getting more
> aggressive, the copy is gone and the allocator's freedom is reduced.
> 
> I'll try to have a look at what the allocator is doing, but I doubt it's
> realistically something that can be addressed in this release cycle.

I am agree.  It will be probably hard to fix in IRA on this stage.

Coalescing is a controversial thing.  Therefore there are so many coalescing
algorithms.  I've tried a lot of them when I worked on global RA.  Finally, I
found that implicit coalescing worked the best.  The word `implicit` means that
we propagate hard register preferences (through copies, including implicit ones
for two-operand constraints) from already assigned pseudos to unassigned ones. 
When it is possible to assign the same hard register, we do it and remove the
copies. Otherwise, we still can assign a hard register which might be
impossible after we explicitly coalesced two pseudos.

Only LRA does explicit coalescing for pseudos assigned to memory as we have no
constraints on # stack slots and memory-memory moves are expensive and require
additional hard reg.

I guess probably this sort of PR could be fixed if we had live-range splitting
in any place not only on the loop borders.  But it might create other PRs if it
makes a wrong decisions :)  Unfortunately, it is all about heuristics.  They
can work successfully in one case and do bad things in another case.  The
performance of credible benchmarks should be a criterion.

[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0

2016-03-23 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

--- Comment #9 from Jeffrey A. Law  ---
I think that's a fair characterization.  The extra copy emitted by the older
compiler gives the allocator more freedom.   With coalescing getting more
aggressive, the copy is gone and the allocator's freedom is reduced.

I'll try to have a look at what the allocator is doing, but I doubt it's
realistically something that can be addressed in this release cycle.

[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0

2016-03-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization, ra
   Priority|P3  |P1
 CC||vmakarov at gcc dot gnu.org

--- Comment #8 from Richard Biener  ---
So it's an RA issue then which previously was mitigated by the extra copy (and
thus split life-range).

[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0

2016-03-10 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

Jeffrey A. Law  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-03-10
 CC||law at redhat dot com
 Ever confirmed|0   |1

--- Comment #7 from Jeffrey A. Law  ---
AFAICT the coalescing code is working as expected here.  Working with
r226900...

So, the only real statement of interest is:

  # iftmp.0_1 = PHI 


Which results in the following partition map:

Partition 0 (iftmp.0_1 - 1 )
Partition 1 (line_7(D) - 7 )
Partition 2 (iftmp.0_18 - 18 )

And the following coalesce list:

Coalesce list: (1)iftmp.0_1 & (18)iftmp.0_18 [map: 0, 2] : Success -> 0

Note that new_line_9 isn't ever processed.  Which is a bit odd to say the
least.  Moving to r226901 we have:

Partition 0 (iftmp.0_1 - 1 )
Partition 1 (line_7(D) - 7 )
Partition 2 (new_line_9 - 9 )
Partition 3 (iftmp.0_18 - 18 )

Coalesce list: (1)iftmp.0_1 & (9)new_line_9 [map: 0, 2] : Success -> 0
Coalesce list: (1)iftmp.0_1 & (18)iftmp.0_18 [map: 0, 3] : Success -> 0

Note that we coalesced new_line_9 into the same partition as iftmp.0_{1,18}.

That seems valid given their use in the PHI and my quick review of the
conflicts.  

So looking at the actual expansion r226900 will emit a copy from new_line_9
into iftmp.0_{1,18}.  That's a result of r226900 not coalescing the objects.

So AFAICT this isn't a coalescing issue, at least not at the gimple->rtl
expansion point.

[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0

2016-03-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164

Richard Biener  changed:

   What|Removed |Added

 CC||law at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
   Target Milestone|--- |6.0
Summary|Code/performance regression |[6 Regression]
   |due to poor register|Code/performance regression
   |allocation on Cortex-M0 |due to poor register
   ||allocation on Cortex-M0

--- Comment #6 from Richard Biener  ---
So probably another coalescing "fallout"