[Bug tree-optimization/60172] ARM performance regression from trunk@207239

rguenther at suse dot de Thu, 20 Feb 2014 02:02:26 -0800

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172


--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 19 Feb 2014, steven at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60172
> 
> Steven Bosscher <steven at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |steven at gcc dot gnu.org
> 
> --- Comment #12 from Steven Bosscher <steven at gcc dot gnu.org> ---
> (In reply to Joey Ye from comment #11)
> 
> Sometimes it helps to use -fdump-rtl-slim. Matter of taste but I find
> that much easier to interpret than LISP-like RTL dumps.
> 
> Annotated "good expansion":
> ;; _41 = _42 * 4;
> 20: r126=r131<<2
> 
> ;; _40 = _2 + _41;
> 21: r136=r130+r119  // r136=Arr_2_Par_Ref+r119
> 22: r125=r136+r126  // r125=Arr_2_Par_Ref+r119+r131<<2
> 
> ;; MEM[(int[25] *)_51 + 20B] = _34;
> 29: r139=r130+r119  // r139=Arr_2_Par_Ref+r119
> 30: r140=r139+r126  // r140=Arr_2_Par_Ref+r119+r131<<2 (==r125)
> 31: r141=r140+1000  // r141=Arr_2_Par_Ref+r119+r131<<2+1000 (==r125+1000)
> 32: [r141+20]=r124
> 
> In this case, the RHS for the SETs of r140 and r125 are lexically
> identical for value numbering, so the job for CSE is easy.
> 
> 
> Annotated "bad expansion":
> ;; _40 = Arr_2_Par_Ref_22(D) + _12;
> 22: r138=r128+r121        
> 23: r127=r132+r138  // r127=Arr_2_Par_Ref+r128+r121
> 
> ;; _32 = _20 + 1000;
> 29: r124=r121+1000
> 
> ;; MEM[(int[25] *)_51 + 20B] = _34;
> 32: r141=r132+r124  // r141=Arr_2_Par_Ref+r121+1000
> 33: r142=r141+r128  // r142=Arr_2_Par_Ref+r128+r121+1000 (==r127+1000)

(==r138+1000)

> 34: [r142+20]=r126
> 
> Here, the "+1000" confuses CSE. The sets of r127 and r142 have a common
> sub-expression as value, but none of the sub-expressions are lexically 
> identical.  RTL CSE has limited ability to look through sub-expressions
> to identify "same value" sub-expressions (anchors, base regs, etc.) but
> apparently this case is too complex for it to handle.

So expansion generates "better" code (a single insn covering the
two adds), caused by expanding a chain of two regular PLUS_EXPR
rather than a chain of two POINTER_PLUS_EXPRs.

That's of course unfortunate - but I can't see how this should
be not a missed optimization in CSE ...

On the GIMPLE level before expansion we have

 +40 = Arr_2_Par_Ref_22(D) + (_41 + pretmp_20);

 _51 = Arr_2_Par_Ref_22(D) + (_41 + (pretmp_20 + 1000));

thus a similar issue - missed CSE due to bad association (and to
not having a CSE pass after forwprop exposed the opportunity).

Unfortunately we expose the opportunity by late complete unrolling
only because early unrolling says

size: 7-2, last_iteration: 3-0
  Loop size: 7
  Estimated size after unrolling: 8
Not unrolling loop 1: size would grow.

and you can't make it unroll that loop (outer loops are only ever
unrolled early if doing so doesn't increase code-size).

Now the order is, late unroll - reassoc - DOM - forwprop,
exactly the wrong way around to eventuall catch the CSE opportunity
at the GIMPLE level as it would need to be, late unroll - forwprop - 
reassoc - DOM.

Richard.

[Bug tree-optimization/60172] ARM performance regression from trunk@207239

Reply via email to