[Bug rtl-optimization/91154] [10 Regression] 456.hmmer regression on Haswell caused by r272922

ubizjak at gmail dot com Mon, 19 Aug 2019 04:45:57 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154


--- Comment #27 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to rguent...@suse.de from comment #25)
> On Sat, 17 Aug 2019, ubizjak at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
> > 
> > --- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
> > It looks that the patch introduced a (small?) runtime regression of 5% in
> > SPEC2000 300.twolf on haswell [1]. Maybe worth looking at.
> 
> Biggest changes when benchmarking -mno-stv (base) against -mstv (peak):
> 
>    7.28%         37789  twolf_peak.none  twolf_peak.none   [.] ucxx2 
>    4.21%         25709  twolf_base.none  twolf_base.none   [.] ucxx2        
>    3.72%         22584  twolf_base.none  twolf_base.none   [.] new_dbox
>    2.48%         22364  twolf_peak.none  twolf_peak.none   [.] new_dbox
>    1.49%          8270  twolf_base.none  twolf_base.none   [.] sub_penal
>    1.12%          7576  twolf_peak.none  twolf_peak.none   [.] sub_penal
>    1.36%          9314  twolf_peak.none  twolf_peak.none   [.]
> old_assgnto_new2
>    1.11%          5257  twolf_base.none  twolf_base.none   [.]
> old_assgnto_new2
> 
> and in ucxx2 I see
> 
>   0.17 │       mov    %eax,(%rsp)
>   3.55 │       vpmins (%rsp),%xmm0,%xmm1   
>        │       test   %eax,%eax
>   0.22 │       vmovd  %xmm1,%r8d              
>   0.80 │       cmovs  %esi,%r8d
> 
> This is from code like
> 
>   a1LoBin = Trybin/binWidth < 0 ? 0 : (Trybin>numBins ? numBins : Trybin)
> 
> with only the inner one recognized as MIN because 'numBins' is only
> ever loaded conditionally and we don't speculate it.  So we expand
> from
> 
>   _41 = _40 / binWidth.15_36;
>   if (_41 >= 0)
>     goto <bb 5>; [59.00%]
>   else
>     goto <bb 6>; [41.00%]
> 
> bb5:
>   numBins.26_42 = numBins;
>   iftmp.19_315 = MIN_EXPR <_41, numBins.26_42>;
> 
> bb6:
>   # iftmp.19_267 = PHI <iftmp.19_315(5), 0(4)>
> 
> ending up with
> 
>         movl    %r9d, %eax
>         cltd
>         idivl   %ecx
>         movl    %eax, (%rsp)
>         vpminsd (%rsp), %xmm0, %xmm1
>         testl   %eax, %eax
>         vmovd   %xmm1, %r11d
>         cmovs   %esi, %r11d
> 
> and STV converting single-instruction 'chains':
> 
> Collected chain #40... 
>   insns: 381
>   defs to convert: r463, r465
> Computing gain for chain #40...
>   Instruction gain 8 for   381: {r465:SI=smin(r463:SI,[`numBins']);clobber 
> flags:CC;}
>       REG_DEAD r463:SI
>       REG_UNUSED flags:CC
>   Instruction conversion gain: 8 
>   Registers conversion cost: 4
>   Total gain: 4
> Converting chain #40...

Is this in STV1 pass? This (pre-combine) pass should be enabled only for TImode
conversion, a semi-hack where 64bit targets convert memory access to TImode.
General STV should not be ran before combine.

> to me the "spill" to (%rsp) looks suspicious and even more so
> the vector(!) memory use in vpminsd.  RA could have used
> 
>   movd  %eax, %xmm1
>   vpminsd %xmm1, %xmm0, %xmm1
> 
> no?  IRA allocates the pseudo to memory.  Testcase:

This is how IRA handles subregs. Please note, that the memory is correctly
aligned, so vector load does not trip alignment trap. However, on x86 this
approach triggers store forwarding stall.

[Bug rtl-optimization/91154] [10 Regression] 456.hmmer regression on Haswell caused by r272922

Reply via email to