On Tue, Aug 13, 2019 at 8:20 AM Jeff Law <l...@redhat.com> wrote: > > On 8/12/19 6:27 AM, Richard Biener wrote: > > On Fri, 9 Aug 2019, Uros Bizjak wrote: > > > >> On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguent...@suse.de> wrote: > >> > >>>>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI > >>>>>>>>>>>> "TARGET_AVX512F"]) > >>>>>>>>>>>> > >>>>>>>>>>>> and then we need to split DImode for 32bits, too. > >>>>>>>>>>> > >>>>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > >>>>>>>>>>> condition, I'll provide _doubleword splitter later. > >>>>>>>>>> > >>>>>>>>>> Shouldn't that be TARGET_AVX512VL instead? Or does the insn use > >>>>>>>>>> %g0 etc. > >>>>>>>>>> to force use of %zmmN? > >>>>>>>>> > >>>>>>>>> It generates V4SI mode, so - yes, AVX512VL. > >>>>>>>> > >>>>>>>> case SMAX: > >>>>>>>> case SMIN: > >>>>>>>> case UMAX: > >>>>>>>> case UMIN: > >>>>>>>> if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > >>>>>>>> || (mode == SImode && !TARGET_SSE4_1)) > >>>>>>>> return false; > >>>>>>>> > >>>>>>>> so there's no way to use AVX512VL for 32bit? > >>>>>>> > >>>>>>> There is a way, but on 32bit targets, we need to split DImode > >>>>>>> operation to a sequence of SImode operations for unconverted pattern. > >>>>>>> This is of course doable, but somehow more complex than simply > >>>>>>> emitting a DImode compare + DImode cmove, which is what current > >>>>>>> splitter does. So, a follow-up task. > >>>>>> > >>>>>> Please find attached the complete .md part that enables SImode for > >>>>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > >>>>>> targets. The patterns also allows for memory operand 2, so STV has > >>>>>> chance to create the vector pattern with implicit load. In case STV > >>>>>> fails, the memory operand 2 is loaded to the register first; operand > >>>>>> 2 is used in compare and cmove instruction, so pre-loading of the > >>>>>> operand should be beneficial. > >>>>> > >>>>> Thanks. > >>>>> > >>>>>> Also note, that splitting should happen rarely. Due to the cost > >>>>>> function, STV should effectively always convert minmax to a vector > >>>>>> insn. > >>>>> > >>>>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to > >>>>> this kind of "simple" conversion: > >>>>> > >>>>> 5.50 │1d0: test %esi,%es > >>>>> 0.07 │ mov $0x0,%ex > >>>>> │ cmovs %eax,%es > >>>>> 5.84 │ imul %r8d,%es > >>>>> > >>>>> to > >>>>> > >>>>> 0.65 │1e0: vpxor %xmm0,%xmm0,%xmm0 > >>>>> 0.32 │ vpmaxs -0x10(%rsp),%xmm0,%xmm0 > >>>>> 40.45 │ vmovd %xmm0,%eax > >>>>> 2.45 │ imul %r8d,%eax > >>>>> > >>>>> which looks like a RA artifact in the end. We spill %esi only > >>>>> with -mstv here as STV introduces a (subreg:V4SI ...) use > >>>>> of a pseudo ultimatively set from di. STV creates an additional > >>>>> pseudo for this (copy-in) but it places that copy next to the > >>>>> original def rather than next to the start of the chain it > >>>>> converts which is probably the issue why we spill. And this > >>>>> is because it inserts those at each definition of the pseudo > >>>>> rather than just at the reaching definition(s) or at the > >>>>> uses of the pseudo in the chain (that because there may be > >>>>> defs of that pseudo in the chain itself). Note that STV emits > >>>>> such "conversion" copies as simple reg-reg moves: > >>>>> > >>>>> (insn 1094 3 4 2 (set (reg:SI 777) > >>>>> (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 > >>>>> (nil)) > >>>>> > >>>>> but those do not prevail very long (this one gets removed by CSE2). > >>>>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use > >>>>> and computes > >>>>> > >>>>> r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS > >>>>> a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 > >>>>> > >>>>> so I wonder if STV shouldn't instead emit gpr->xmm moves > >>>>> here (but I guess nothing again prevents RTL optimizers from > >>>>> combining that with the single-use in the max instruction...). > >>>>> > >>>>> So this boils down to STV splitting live-ranges but other > >>>>> passes undoing that and then RA not considering splitting > >>>>> live-ranges here, arriving at unoptimal allocation. > >>>>> > >>>>> A testcase showing this issue is (simplified from 464.h264ref > >>>>> UMVLine16Y_11): > >>>>> > >>>>> unsigned short > >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > >>>>> { > >>>>> if (y != width) > >>>>> { > >>>>> y = y < 0 ? 0 : y; > >>>>> return Pic[y * width]; > >>>>> } > >>>>> return Pic[y]; > >>>>> } > >>>>> > >>>>> where the condition and the Pic[y] load mimics the other use of y. > >>>>> Different, even worse spilling is generated by > >>>>> > >>>>> unsigned short > >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > >>>>> { > >>>>> y = y < 0 ? 0 : y; > >>>>> return Pic[y * width] + y; > >>>>> } > >>>>> > >>>>> I guess this all shows that STVs "trick" of simply wrapping > >>>>> integer mode pseudos in (subreg:vector-mode ...) is bad? > >>>>> > >>>>> I've added a (failing) testcase to reflect the above. > >>>> > >>>> Experimenting a bit with just for the conversion insns using > >>>> V4SImode pseudos we end up preserving those moves (but I > >>>> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg > >>>> ends up using movv4si_internal which only leaves us with > >>>> memory for the SImode operand) _plus_ moving the move next > >>>> to the actual use has an effect. Not necssarily a good one > >>>> though: > >>>> > >>>> vpxor %xmm0, %xmm0, %xmm0 > >>>> vmovaps %xmm0, -16(%rsp) > >>>> movl %esi, -16(%rsp) > >>>> vpmaxsd -16(%rsp), %xmm0, %xmm0 > >>>> vmovd %xmm0, %eax > >>>> > >>>> eh? I guess the lowpart set is not good (my patch has this > >>>> as well, but I got saved by never having vector modes to subset...). > >>>> Using > >>>> > >>>> (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) > >>>> (const_vector:V4SI [ > >>>> (const_int 0 [0]) repeated x4 > >>>> ]) > >>>> (const_int 1 [0x1]))) "t3.c":5:10 -1 > >>>> > >>>> for the move ends up with > >>>> > >>>> vpxor %xmm1, %xmm1, %xmm1 > >>>> vpinsrd $0, %esi, %xmm1, %xmm0 > >>>> > >>>> eh? LRA chooses the correct alternative here but somehow > >>>> postreload CSE CSEs the zero with the xmm1 clearing, leading > >>>> to the vpinsrd... (I guess a general issue, not sure if really > >>>> worse - definitely a larger instruction). Unfortunately > >>>> postreload-cse doesn't add a reg-equal note. This happens only > >>>> when emitting the reg move before the use, not doing that emits > >>>> a vmovd as expected. > >>>> > >>>> At least the spilling is gone here. > >>>> > >>>> I am re-testing as follows, the main change is that > >>>> general_scalar_chain::make_vector_copies now generates a > >>>> vector pseudo as destination (and I've fixed up the code > >>>> to not generate (subreg:V4SI (reg:V4SI 1234) 0)). > >>>> > >>>> Hope this fixes the observed slowdowns (it fixes the new testcase). > >>> > >>> It fixes the slowdown observed in 416.gamess and 464.h264ref. > >>> > >>> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > >>> > >>> CCing Jeff who "knows RTL". > >>> > >>> OK? > >> > >> Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid > >> spurious test failures on SSE4.1 targets. > > > > Done. I've also adjusted the i386.md changelog as follows: > > > > * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander. > > (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split. > > (*<maxmin>di3_doubleword): Likewise. > > > > I see > > > > FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest > > FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest > > FAIL: gcc.target/i386/pr78794.c scan-assembler pandn > > > > with the latest patch (this is with -m32) where -mstv causes > > all spills to go away and the cmoves replaced (so clearly > > better code after the patch) for pr65105-5.c, no obvious > > improvements for pr65105-3.c where cmov does appear with -mstv. > > I'd rather not "fix" those by adding -mno-stv but instead have > > the Intel people fix costing for slm and/or decide what to do. > > For pr65105-3.c I'm not sure why if-conversion didn't choose > > to use cmov, so clearly the enabled minmax patterns expose the > > "failure" here. > I'm not sure how much effort Intel is putting into Silvermont tuning > these days. So I'd suggest giving HJ a heads-up and a reasonable period > of time to take a looksie, but I wouldn't hold the patch for long due to > a Silvermont tuning issue.
Leave pr65105-3.c to fail for now. We can take a look later. Thanks. -- H.J.