[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-03-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

--- Comment #11 from Richard Biener  ---
Author: rguenth
Date: Tue Mar 27 13:23:15 2018
New Revision: 258881

URL: https://gcc.gnu.org/viewcvs?rev=258881=gcc=rev
Log:
2018-03-27  Richard Biener  

PR middle-ed/84067
* match.pd ((A * C) +- (B * C) -> (A+-B) * C): Guard with
explicit single_use checks.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/match.pd

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-03-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Richard Biener  ---
Fixed.

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-03-27 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1
 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #9 from Richard Biener  ---
OK, so I'll stick some single_use markers on the new patterns.

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-01-29 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

--- Comment #8 from ktkachov at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #7)
> On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> > 
> > --- Comment #6 from ktkachov at gcc dot gnu.org ---
> > (In reply to rguent...@suse.de from comment #5)
> > > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> > > > 
> > > > --- Comment #3 from ktkachov at gcc dot gnu.org ---
> > > > (In reply to Richard Biener from comment #2)
> > > > > So any hint on whether the code after r257077 is better or worse than 
> > > > > before?
> > > > 
> > > > Looks worse unfortunately:
> > > > For aarch64 at -O2 it generates:
> > > > foo:
> > > > mov w3, 44
> > > > mov w2, 40
> > > > mov w5, 1
> > > > mov w4, 2
> > > > smull   x3, w1, w3
> > > > smull   x2, w1, w2
> > > > str w5, [x0, x3]
> > > > add x2, x2, 400
> > > > add x1, x2, x1, sxtw 2
> > > > str w4, [x0, x1]
> > > > ret
> > > > 
> > > > whereas with r257077 it generates the shorter:
> > > > foo:
> > > > mov w3, 40
> > > > sxtwx2, w1
> > > > mov w4, 1
> > > > smaddl  x0, w1, w3, x0
> > > > mov w3, 2
> > > > add x1, x0, x2, lsl 2
> > > > str w4, [x0, x2, lsl 2]
> > > > str w3, [x1, 400]
> > > > ret
> > > 
> > > So shorter is worse?  Might be because I don't understand the
> > > difference between the 'lsl 2' and the 'sxtw 2' or the cost
> > > of the [x1, 400] addressing.
> > 
> > Sorry, I messed up the writeup. Let me try again.
> > The shorter sequence (with the smaddl) is the good one and is produced
> > *without* r257077. After r257077 we generate the longer and worse sequence 
> > with
> > two smull.
> 
> I see the shorter sequence with TOT, r257077 included.  The testcase
> explicitely checks for no widen-mult-plus but we now have two:
> 
>[local count: 1073741825]:
>   _17 = Idx_6(D) w* 44;
>   _13 = Arr_7(D) + _17;
>   MEM[(int[10] *)_13] = 1;
>   _4 = WIDEN_MULT_PLUS_EXPR ;
>   _18 = WIDEN_MULT_PLUS_EXPR ;
>   _16 = Arr_7(D) + _18;
>   MEM[(int[10] *)_16] = 2;
>   return;
> 
> note the "shorter" sequence I see is
> 
> foo:
> mov x4, 400
> mov w3, 40
> mov w2, 44
> mov w5, 1
> smaddl  x3, w1, w3, x4
> mov w4, 2
> smull   x2, w1, w2
> add x1, x3, x1, sxtw 2
> str w5, [x0, x2]
> str w4, [x0, x1]
> ret
> 
> which doesn't 1:1 match either of yours.

Hmm, the exact instruction mix will depend a lot on the cpu tuning in question
because the RTX costs affect the widening multiplication expansion, but at the
tree level I see only one WIDEN_MULT_PLUS_EXPR with current ToT (with r257077):

   [local count: 1073741825]:
  _1 = (long unsigned int) Idx_6(D);
  _2 = Idx_6(D) w* 40;
  _3 = Arr_7(D) + _2;
  _12 = Idx_6(D) w* 4;
  _11 = Idx_6(D) w* 44;
  _13 = Arr_7(D) + _11;
  MEM[(int[10] *)_13] = 1;
  _4 = _2 + 400;
  _5 = Arr_7(D) + _4;
  _14 = WIDEN_MULT_PLUS_EXPR ;
  _16 = Arr_7(D) + _14;
  MEM[(int[10] *)_16] = 2;
  return;

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-01-29 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

--- Comment #7 from rguenther at suse dot de  ---
On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> 
> --- Comment #6 from ktkachov at gcc dot gnu.org ---
> (In reply to rguent...@suse.de from comment #5)
> > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> > > 
> > > --- Comment #3 from ktkachov at gcc dot gnu.org ---
> > > (In reply to Richard Biener from comment #2)
> > > > So any hint on whether the code after r257077 is better or worse than 
> > > > before?
> > > 
> > > Looks worse unfortunately:
> > > For aarch64 at -O2 it generates:
> > > foo:
> > > mov w3, 44
> > > mov w2, 40
> > > mov w5, 1
> > > mov w4, 2
> > > smull   x3, w1, w3
> > > smull   x2, w1, w2
> > > str w5, [x0, x3]
> > > add x2, x2, 400
> > > add x1, x2, x1, sxtw 2
> > > str w4, [x0, x1]
> > > ret
> > > 
> > > whereas with r257077 it generates the shorter:
> > > foo:
> > > mov w3, 40
> > > sxtwx2, w1
> > > mov w4, 1
> > > smaddl  x0, w1, w3, x0
> > > mov w3, 2
> > > add x1, x0, x2, lsl 2
> > > str w4, [x0, x2, lsl 2]
> > > str w3, [x1, 400]
> > > ret
> > 
> > So shorter is worse?  Might be because I don't understand the
> > difference between the 'lsl 2' and the 'sxtw 2' or the cost
> > of the [x1, 400] addressing.
> 
> Sorry, I messed up the writeup. Let me try again.
> The shorter sequence (with the smaddl) is the good one and is produced
> *without* r257077. After r257077 we generate the longer and worse sequence 
> with
> two smull.

I see the shorter sequence with TOT, r257077 included.  The testcase
explicitely checks for no widen-mult-plus but we now have two:

   [local count: 1073741825]:
  _17 = Idx_6(D) w* 44;
  _13 = Arr_7(D) + _17;
  MEM[(int[10] *)_13] = 1;
  _4 = WIDEN_MULT_PLUS_EXPR ;
  _18 = WIDEN_MULT_PLUS_EXPR ;
  _16 = Arr_7(D) + _18;
  MEM[(int[10] *)_16] = 2;
  return;

note the "shorter" sequence I see is

foo:
mov x4, 400
mov w3, 40
mov w2, 44
mov w5, 1
smaddl  x3, w1, w3, x4
mov w4, 2
smull   x2, w1, w2
add x1, x3, x1, sxtw 2
str w5, [x0, x2]
str w4, [x0, x1]
ret

which doesn't 1:1 match either of yours.

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-01-29 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

--- Comment #6 from ktkachov at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #5)
> On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> > 
> > --- Comment #3 from ktkachov at gcc dot gnu.org ---
> > (In reply to Richard Biener from comment #2)
> > > So any hint on whether the code after r257077 is better or worse than 
> > > before?
> > 
> > Looks worse unfortunately:
> > For aarch64 at -O2 it generates:
> > foo:
> > mov w3, 44
> > mov w2, 40
> > mov w5, 1
> > mov w4, 2
> > smull   x3, w1, w3
> > smull   x2, w1, w2
> > str w5, [x0, x3]
> > add x2, x2, 400
> > add x1, x2, x1, sxtw 2
> > str w4, [x0, x1]
> > ret
> > 
> > whereas with r257077 it generates the shorter:
> > foo:
> > mov w3, 40
> > sxtwx2, w1
> > mov w4, 1
> > smaddl  x0, w1, w3, x0
> > mov w3, 2
> > add x1, x0, x2, lsl 2
> > str w4, [x0, x2, lsl 2]
> > str w3, [x1, 400]
> > ret
> 
> So shorter is worse?  Might be because I don't understand the
> difference between the 'lsl 2' and the 'sxtw 2' or the cost
> of the [x1, 400] addressing.

Sorry, I messed up the writeup. Let me try again.
The shorter sequence (with the smaddl) is the good one and is produced
*without* r257077. After r257077 we generate the longer and worse sequence with
two smull.

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-01-29 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

--- Comment #5 from rguenther at suse dot de  ---
On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> 
> --- Comment #3 from ktkachov at gcc dot gnu.org ---
> (In reply to Richard Biener from comment #2)
> > So any hint on whether the code after r257077 is better or worse than 
> > before?
> 
> Looks worse unfortunately:
> For aarch64 at -O2 it generates:
> foo:
> mov w3, 44
> mov w2, 40
> mov w5, 1
> mov w4, 2
> smull   x3, w1, w3
> smull   x2, w1, w2
> str w5, [x0, x3]
> add x2, x2, 400
> add x1, x2, x1, sxtw 2
> str w4, [x0, x1]
> ret
> 
> whereas with r257077 it generates the shorter:
> foo:
> mov w3, 40
> sxtwx2, w1
> mov w4, 1
> smaddl  x0, w1, w3, x0
> mov w3, 2
> add x1, x0, x2, lsl 2
> str w4, [x0, x2, lsl 2]
> str w3, [x1, 400]
> ret

So shorter is worse?  Might be because I don't understand the
difference between the 'lsl 2' and the 'sxtw 2' or the cost
of the [x1, 400] addressing.

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-01-29 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

--- Comment #4 from ktkachov at gcc dot gnu.org ---
(In reply to ktkachov from comment #3)
> (In reply to Richard Biener from comment #2)
> > So any hint on whether the code after r257077 is better or worse than 
> > before?
> 
> Looks worse unfortunately:
> For aarch64 at -O2 it generates:
> foo:
>   mov w3, 44
>   mov w2, 40
>   mov w5, 1
>   mov w4, 2
>   smull   x3, w1, w3
>   smull   x2, w1, w2
>   str w5, [x0, x3]
>   add x2, x2, 400
>   add x1, x2, x1, sxtw 2
>   str w4, [x0, x1]
>   ret
> 
> whereas with r257077 it generates the shorter:

Sorry, I meant to write "with r257077 reverted..."

> foo:
>   mov w3, 40
>   sxtwx2, w1
>   mov w4, 1
>   smaddl  x0, w1, w3, x0
>   mov w3, 2
>   add x1, x0, x2, lsl 2
>   str w4, [x0, x2, lsl 2]
>   str w3, [x1, 400]
>   ret

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-01-29 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

--- Comment #3 from ktkachov at gcc dot gnu.org ---
(In reply to Richard Biener from comment #2)
> So any hint on whether the code after r257077 is better or worse than before?

Looks worse unfortunately:
For aarch64 at -O2 it generates:
foo:
mov w3, 44
mov w2, 40
mov w5, 1
mov w4, 2
smull   x3, w1, w3
smull   x2, w1, w2
str w5, [x0, x3]
add x2, x2, 400
add x1, x2, x1, sxtw 2
str w4, [x0, x1]
ret

whereas with r257077 it generates the shorter:
foo:
mov w3, 40
sxtwx2, w1
mov w4, 1
smaddl  x0, w1, w3, x0
mov w3, 2
add x1, x0, x2, lsl 2
str w4, [x0, x2, lsl 2]
str w3, [x1, 400]
ret

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-01-29 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #2 from Richard Biener  ---
So any hint on whether the code after r257077 is better or worse than before?

[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

2018-01-26 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
  Known to work||7.2.1
   Keywords||missed-optimization
   Last reconfirmed||2018-01-26
 CC||ktkachov at gcc dot gnu.org
 Ever confirmed|0   |1
   Target Milestone|--- |8.0
  Known to fail||8.0

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed.