https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
--- Comment #27 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to rguent...@suse.de from comment #25) > On Sat, 17 Aug 2019, ubizjak at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154 > > > > --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> --- > > It looks that the patch introduced a (small?) runtime regression of 5% in > > SPEC2000 300.twolf on haswell [1]. Maybe worth looking at. > > Biggest changes when benchmarking -mno-stv (base) against -mstv (peak): > > 7.28% 37789 twolf_peak.none twolf_peak.none [.] ucxx2 > 4.21% 25709 twolf_base.none twolf_base.none [.] ucxx2 > 3.72% 22584 twolf_base.none twolf_base.none [.] new_dbox > 2.48% 22364 twolf_peak.none twolf_peak.none [.] new_dbox > 1.49% 8270 twolf_base.none twolf_base.none [.] sub_penal > 1.12% 7576 twolf_peak.none twolf_peak.none [.] sub_penal > 1.36% 9314 twolf_peak.none twolf_peak.none [.] > old_assgnto_new2 > 1.11% 5257 twolf_base.none twolf_base.none [.] > old_assgnto_new2 > > and in ucxx2 I see > > 0.17 │ mov %eax,(%rsp) > 3.55 │ vpmins (%rsp),%xmm0,%xmm1 > │ test %eax,%eax > 0.22 │ vmovd %xmm1,%r8d > 0.80 │ cmovs %esi,%r8d > > This is from code like > > a1LoBin = Trybin/binWidth < 0 ? 0 : (Trybin>numBins ? numBins : Trybin) > > with only the inner one recognized as MIN because 'numBins' is only > ever loaded conditionally and we don't speculate it. So we expand > from > > _41 = _40 / binWidth.15_36; > if (_41 >= 0) > goto <bb 5>; [59.00%] > else > goto <bb 6>; [41.00%] > > bb5: > numBins.26_42 = numBins; > iftmp.19_315 = MIN_EXPR <_41, numBins.26_42>; > > bb6: > # iftmp.19_267 = PHI <iftmp.19_315(5), 0(4)> > > ending up with > > movl %r9d, %eax > cltd > idivl %ecx > movl %eax, (%rsp) > vpminsd (%rsp), %xmm0, %xmm1 > testl %eax, %eax > vmovd %xmm1, %r11d > cmovs %esi, %r11d > > and STV converting single-instruction 'chains': > > Collected chain #40... > insns: 381 > defs to convert: r463, r465 > Computing gain for chain #40... > Instruction gain 8 for 381: {r465:SI=smin(r463:SI,[`numBins']);clobber > flags:CC;} > REG_DEAD r463:SI > REG_UNUSED flags:CC > Instruction conversion gain: 8 > Registers conversion cost: 4 > Total gain: 4 > Converting chain #40... Is this in STV1 pass? This (pre-combine) pass should be enabled only for TImode conversion, a semi-hack where 64bit targets convert memory access to TImode. General STV should not be ran before combine. > to me the "spill" to (%rsp) looks suspicious and even more so > the vector(!) memory use in vpminsd. RA could have used > > movd %eax, %xmm1 > vpminsd %xmm1, %xmm0, %xmm1 > > no? IRA allocates the pseudo to memory. Testcase: This is how IRA handles subregs. Please note, that the memory is correctly aligned, so vector load does not trip alignment trap. However, on x86 this approach triggers store forwarding stall.