[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 --- Comment #11 from Richard Biener --- Author: rguenth Date: Tue Mar 27 13:23:15 2018 New Revision: 258881 URL: https://gcc.gnu.org/viewcvs?rev=258881=gcc=rev Log: 2018-03-27 Richard BienerPR middle-ed/84067 * match.pd ((A * C) +- (B * C) -> (A+-B) * C): Guard with explicit single_use checks. Modified: trunk/gcc/ChangeLog trunk/gcc/match.pd
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #10 from Richard Biener --- Fixed.
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 Richard Biener changed: What|Removed |Added Priority|P3 |P1 Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #9 from Richard Biener --- OK, so I'll stick some single_use markers on the new patterns.
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 --- Comment #8 from ktkachov at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #7) > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 > > > > --- Comment #6 from ktkachov at gcc dot gnu.org --- > > (In reply to rguent...@suse.de from comment #5) > > > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote: > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 > > > > > > > > --- Comment #3 from ktkachov at gcc dot gnu.org --- > > > > (In reply to Richard Biener from comment #2) > > > > > So any hint on whether the code after r257077 is better or worse than > > > > > before? > > > > > > > > Looks worse unfortunately: > > > > For aarch64 at -O2 it generates: > > > > foo: > > > > mov w3, 44 > > > > mov w2, 40 > > > > mov w5, 1 > > > > mov w4, 2 > > > > smull x3, w1, w3 > > > > smull x2, w1, w2 > > > > str w5, [x0, x3] > > > > add x2, x2, 400 > > > > add x1, x2, x1, sxtw 2 > > > > str w4, [x0, x1] > > > > ret > > > > > > > > whereas with r257077 it generates the shorter: > > > > foo: > > > > mov w3, 40 > > > > sxtwx2, w1 > > > > mov w4, 1 > > > > smaddl x0, w1, w3, x0 > > > > mov w3, 2 > > > > add x1, x0, x2, lsl 2 > > > > str w4, [x0, x2, lsl 2] > > > > str w3, [x1, 400] > > > > ret > > > > > > So shorter is worse? Might be because I don't understand the > > > difference between the 'lsl 2' and the 'sxtw 2' or the cost > > > of the [x1, 400] addressing. > > > > Sorry, I messed up the writeup. Let me try again. > > The shorter sequence (with the smaddl) is the good one and is produced > > *without* r257077. After r257077 we generate the longer and worse sequence > > with > > two smull. > > I see the shorter sequence with TOT, r257077 included. The testcase > explicitely checks for no widen-mult-plus but we now have two: > >[local count: 1073741825]: > _17 = Idx_6(D) w* 44; > _13 = Arr_7(D) + _17; > MEM[(int[10] *)_13] = 1; > _4 = WIDEN_MULT_PLUS_EXPR; > _18 = WIDEN_MULT_PLUS_EXPR ; > _16 = Arr_7(D) + _18; > MEM[(int[10] *)_16] = 2; > return; > > note the "shorter" sequence I see is > > foo: > mov x4, 400 > mov w3, 40 > mov w2, 44 > mov w5, 1 > smaddl x3, w1, w3, x4 > mov w4, 2 > smull x2, w1, w2 > add x1, x3, x1, sxtw 2 > str w5, [x0, x2] > str w4, [x0, x1] > ret > > which doesn't 1:1 match either of yours. Hmm, the exact instruction mix will depend a lot on the cpu tuning in question because the RTX costs affect the widening multiplication expansion, but at the tree level I see only one WIDEN_MULT_PLUS_EXPR with current ToT (with r257077): [local count: 1073741825]: _1 = (long unsigned int) Idx_6(D); _2 = Idx_6(D) w* 40; _3 = Arr_7(D) + _2; _12 = Idx_6(D) w* 4; _11 = Idx_6(D) w* 44; _13 = Arr_7(D) + _11; MEM[(int[10] *)_13] = 1; _4 = _2 + 400; _5 = Arr_7(D) + _4; _14 = WIDEN_MULT_PLUS_EXPR ; _16 = Arr_7(D) + _14; MEM[(int[10] *)_16] = 2; return;
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 --- Comment #7 from rguenther at suse dot de --- On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 > > --- Comment #6 from ktkachov at gcc dot gnu.org --- > (In reply to rguent...@suse.de from comment #5) > > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 > > > > > > --- Comment #3 from ktkachov at gcc dot gnu.org --- > > > (In reply to Richard Biener from comment #2) > > > > So any hint on whether the code after r257077 is better or worse than > > > > before? > > > > > > Looks worse unfortunately: > > > For aarch64 at -O2 it generates: > > > foo: > > > mov w3, 44 > > > mov w2, 40 > > > mov w5, 1 > > > mov w4, 2 > > > smull x3, w1, w3 > > > smull x2, w1, w2 > > > str w5, [x0, x3] > > > add x2, x2, 400 > > > add x1, x2, x1, sxtw 2 > > > str w4, [x0, x1] > > > ret > > > > > > whereas with r257077 it generates the shorter: > > > foo: > > > mov w3, 40 > > > sxtwx2, w1 > > > mov w4, 1 > > > smaddl x0, w1, w3, x0 > > > mov w3, 2 > > > add x1, x0, x2, lsl 2 > > > str w4, [x0, x2, lsl 2] > > > str w3, [x1, 400] > > > ret > > > > So shorter is worse? Might be because I don't understand the > > difference between the 'lsl 2' and the 'sxtw 2' or the cost > > of the [x1, 400] addressing. > > Sorry, I messed up the writeup. Let me try again. > The shorter sequence (with the smaddl) is the good one and is produced > *without* r257077. After r257077 we generate the longer and worse sequence > with > two smull. I see the shorter sequence with TOT, r257077 included. The testcase explicitely checks for no widen-mult-plus but we now have two: [local count: 1073741825]: _17 = Idx_6(D) w* 44; _13 = Arr_7(D) + _17; MEM[(int[10] *)_13] = 1; _4 = WIDEN_MULT_PLUS_EXPR; _18 = WIDEN_MULT_PLUS_EXPR ; _16 = Arr_7(D) + _18; MEM[(int[10] *)_16] = 2; return; note the "shorter" sequence I see is foo: mov x4, 400 mov w3, 40 mov w2, 44 mov w5, 1 smaddl x3, w1, w3, x4 mov w4, 2 smull x2, w1, w2 add x1, x3, x1, sxtw 2 str w5, [x0, x2] str w4, [x0, x1] ret which doesn't 1:1 match either of yours.
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 --- Comment #6 from ktkachov at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #5) > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 > > > > --- Comment #3 from ktkachov at gcc dot gnu.org --- > > (In reply to Richard Biener from comment #2) > > > So any hint on whether the code after r257077 is better or worse than > > > before? > > > > Looks worse unfortunately: > > For aarch64 at -O2 it generates: > > foo: > > mov w3, 44 > > mov w2, 40 > > mov w5, 1 > > mov w4, 2 > > smull x3, w1, w3 > > smull x2, w1, w2 > > str w5, [x0, x3] > > add x2, x2, 400 > > add x1, x2, x1, sxtw 2 > > str w4, [x0, x1] > > ret > > > > whereas with r257077 it generates the shorter: > > foo: > > mov w3, 40 > > sxtwx2, w1 > > mov w4, 1 > > smaddl x0, w1, w3, x0 > > mov w3, 2 > > add x1, x0, x2, lsl 2 > > str w4, [x0, x2, lsl 2] > > str w3, [x1, 400] > > ret > > So shorter is worse? Might be because I don't understand the > difference between the 'lsl 2' and the 'sxtw 2' or the cost > of the [x1, 400] addressing. Sorry, I messed up the writeup. Let me try again. The shorter sequence (with the smaddl) is the good one and is produced *without* r257077. After r257077 we generate the longer and worse sequence with two smull.
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 --- Comment #5 from rguenther at suse dot de --- On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 > > --- Comment #3 from ktkachov at gcc dot gnu.org --- > (In reply to Richard Biener from comment #2) > > So any hint on whether the code after r257077 is better or worse than > > before? > > Looks worse unfortunately: > For aarch64 at -O2 it generates: > foo: > mov w3, 44 > mov w2, 40 > mov w5, 1 > mov w4, 2 > smull x3, w1, w3 > smull x2, w1, w2 > str w5, [x0, x3] > add x2, x2, 400 > add x1, x2, x1, sxtw 2 > str w4, [x0, x1] > ret > > whereas with r257077 it generates the shorter: > foo: > mov w3, 40 > sxtwx2, w1 > mov w4, 1 > smaddl x0, w1, w3, x0 > mov w3, 2 > add x1, x0, x2, lsl 2 > str w4, [x0, x2, lsl 2] > str w3, [x1, 400] > ret So shorter is worse? Might be because I don't understand the difference between the 'lsl 2' and the 'sxtw 2' or the cost of the [x1, 400] addressing.
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 --- Comment #4 from ktkachov at gcc dot gnu.org --- (In reply to ktkachov from comment #3) > (In reply to Richard Biener from comment #2) > > So any hint on whether the code after r257077 is better or worse than > > before? > > Looks worse unfortunately: > For aarch64 at -O2 it generates: > foo: > mov w3, 44 > mov w2, 40 > mov w5, 1 > mov w4, 2 > smull x3, w1, w3 > smull x2, w1, w2 > str w5, [x0, x3] > add x2, x2, 400 > add x1, x2, x1, sxtw 2 > str w4, [x0, x1] > ret > > whereas with r257077 it generates the shorter: Sorry, I meant to write "with r257077 reverted..." > foo: > mov w3, 40 > sxtwx2, w1 > mov w4, 1 > smaddl x0, w1, w3, x0 > mov w3, 2 > add x1, x0, x2, lsl 2 > str w4, [x0, x2, lsl 2] > str w3, [x1, 400] > ret
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 --- Comment #3 from ktkachov at gcc dot gnu.org --- (In reply to Richard Biener from comment #2) > So any hint on whether the code after r257077 is better or worse than before? Looks worse unfortunately: For aarch64 at -O2 it generates: foo: mov w3, 44 mov w2, 40 mov w5, 1 mov w4, 2 smull x3, w1, w3 smull x2, w1, w2 str w5, [x0, x3] add x2, x2, 400 add x1, x2, x1, sxtw 2 str w4, [x0, x1] ret whereas with r257077 it generates the shorter: foo: mov w3, 40 sxtwx2, w1 mov w4, 1 smaddl x0, w1, w3, x0 mov w3, 2 add x1, x0, x2, lsl 2 str w4, [x0, x2, lsl 2] str w3, [x1, 400] ret
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #2 from Richard Biener --- So any hint on whether the code after r257077 is better or worse than before?
[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Known to work||7.2.1 Keywords||missed-optimization Last reconfirmed||2018-01-26 CC||ktkachov at gcc dot gnu.org Ever confirmed|0 |1 Target Milestone|--- |8.0 Known to fail||8.0 --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed.