On Tue, Aug 13, 2019 at 8:20 AM Jeff Law <l...@redhat.com> wrote:
>
> On 8/12/19 6:27 AM, Richard Biener wrote:
> > On Fri, 9 Aug 2019, Uros Bizjak wrote:
> >
> >> On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguent...@suse.de> wrote:
> >>
> >>>>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI 
> >>>>>>>>>>>> "TARGET_AVX512F"])
> >>>>>>>>>>>>
> >>>>>>>>>>>> and then we need to split DImode for 32bits, too.
> >>>>>>>>>>>
> >>>>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> >>>>>>>>>>> condition, I'll provide _doubleword splitter later.
> >>>>>>>>>>
> >>>>>>>>>> Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use 
> >>>>>>>>>> %g0 etc.
> >>>>>>>>>> to force use of %zmmN?
> >>>>>>>>>
> >>>>>>>>> It generates V4SI mode, so - yes, AVX512VL.
> >>>>>>>>
> >>>>>>>>     case SMAX:
> >>>>>>>>     case SMIN:
> >>>>>>>>     case UMAX:
> >>>>>>>>     case UMIN:
> >>>>>>>>       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> >>>>>>>>           || (mode == SImode && !TARGET_SSE4_1))
> >>>>>>>>         return false;
> >>>>>>>>
> >>>>>>>> so there's no way to use AVX512VL for 32bit?
> >>>>>>>
> >>>>>>> There is a way, but on 32bit targets, we need to split DImode
> >>>>>>> operation to a sequence of SImode operations for unconverted pattern.
> >>>>>>> This is of course doable, but somehow more complex than simply
> >>>>>>> emitting a DImode compare + DImode cmove, which is what current
> >>>>>>> splitter does. So, a follow-up task.
> >>>>>>
> >>>>>> Please find attached the complete .md part that enables SImode for
> >>>>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> >>>>>> targets. The patterns also allows for memory operand 2, so STV has
> >>>>>> chance to create the vector pattern with implicit load. In case STV
> >>>>>> fails, the memory operand 2 is loaded to the register first;  operand
> >>>>>> 2 is used in compare and cmove instruction, so pre-loading of the
> >>>>>> operand should be beneficial.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>> Also note, that splitting should happen rarely. Due to the cost
> >>>>>> function, STV should effectively always convert minmax to a vector
> >>>>>> insn.
> >>>>>
> >>>>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> >>>>> this kind of "simple" conversion:
> >>>>>
> >>>>>   5.50 │1d0:   test   %esi,%es
> >>>>>   0.07 │       mov    $0x0,%ex
> >>>>>        │       cmovs  %eax,%es
> >>>>>   5.84 │       imul   %r8d,%es
> >>>>>
> >>>>> to
> >>>>>
> >>>>>   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
> >>>>>   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
> >>>>>  40.45 │       vmovd  %xmm0,%eax
> >>>>>   2.45 │       imul   %r8d,%eax
> >>>>>
> >>>>> which looks like a RA artifact in the end.  We spill %esi only
> >>>>> with -mstv here as STV introduces a (subreg:V4SI ...) use
> >>>>> of a pseudo ultimatively set from di.  STV creates an additional
> >>>>> pseudo for this (copy-in) but it places that copy next to the
> >>>>> original def rather than next to the start of the chain it
> >>>>> converts which is probably the issue why we spill.  And this
> >>>>> is because it inserts those at each definition of the pseudo
> >>>>> rather than just at the reaching definition(s) or at the
> >>>>> uses of the pseudo in the chain (that because there may be
> >>>>> defs of that pseudo in the chain itself).  Note that STV emits
> >>>>> such "conversion" copies as simple reg-reg moves:
> >>>>>
> >>>>> (insn 1094 3 4 2 (set (reg:SI 777)
> >>>>>         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
> >>>>>      (nil))
> >>>>>
> >>>>> but those do not prevail very long (this one gets removed by CSE2).
> >>>>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> >>>>> and computes
> >>>>>
> >>>>>     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
> >>>>>     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> >>>>>
> >>>>> so I wonder if STV shouldn't instead emit gpr->xmm moves
> >>>>> here (but I guess nothing again prevents RTL optimizers from
> >>>>> combining that with the single-use in the max instruction...).
> >>>>>
> >>>>> So this boils down to STV splitting live-ranges but other
> >>>>> passes undoing that and then RA not considering splitting
> >>>>> live-ranges here, arriving at unoptimal allocation.
> >>>>>
> >>>>> A testcase showing this issue is (simplified from 464.h264ref
> >>>>> UMVLine16Y_11):
> >>>>>
> >>>>> unsigned short
> >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> >>>>> {
> >>>>>   if (y != width)
> >>>>>     {
> >>>>>       y = y < 0 ? 0 : y;
> >>>>>       return Pic[y * width];
> >>>>>     }
> >>>>>   return Pic[y];
> >>>>> }
> >>>>>
> >>>>> where the condition and the Pic[y] load mimics the other use of y.
> >>>>> Different, even worse spilling is generated by
> >>>>>
> >>>>> unsigned short
> >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> >>>>> {
> >>>>>   y = y < 0 ? 0 : y;
> >>>>>   return Pic[y * width] + y;
> >>>>> }
> >>>>>
> >>>>> I guess this all shows that STVs "trick" of simply wrapping
> >>>>> integer mode pseudos in (subreg:vector-mode ...) is bad?
> >>>>>
> >>>>> I've added a (failing) testcase to reflect the above.
> >>>>
> >>>> Experimenting a bit with just for the conversion insns using
> >>>> V4SImode pseudos we end up preserving those moves (but I
> >>>> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
> >>>> ends up using movv4si_internal which only leaves us with
> >>>> memory for the SImode operand) _plus_ moving the move next
> >>>> to the actual use has an effect.  Not necssarily a good one
> >>>> though:
> >>>>
> >>>>         vpxor   %xmm0, %xmm0, %xmm0
> >>>>         vmovaps %xmm0, -16(%rsp)
> >>>>         movl    %esi, -16(%rsp)
> >>>>         vpmaxsd -16(%rsp), %xmm0, %xmm0
> >>>>         vmovd   %xmm0, %eax
> >>>>
> >>>> eh?  I guess the lowpart set is not good (my patch has this
> >>>> as well, but I got saved by never having vector modes to subset...).
> >>>> Using
> >>>>
> >>>>     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
> >>>>             (const_vector:V4SI [
> >>>>                     (const_int 0 [0]) repeated x4
> >>>>                 ])
> >>>>             (const_int 1 [0x1]))) "t3.c":5:10 -1
> >>>>
> >>>> for the move ends up with
> >>>>
> >>>>         vpxor   %xmm1, %xmm1, %xmm1
> >>>>         vpinsrd $0, %esi, %xmm1, %xmm0
> >>>>
> >>>> eh?  LRA chooses the correct alternative here but somehow
> >>>> postreload CSE CSEs the zero with the xmm1 clearing, leading
> >>>> to the vpinsrd...  (I guess a general issue, not sure if really
> >>>> worse - definitely a larger instruction).  Unfortunately
> >>>> postreload-cse doesn't add a reg-equal note.  This happens only
> >>>> when emitting the reg move before the use, not doing that emits
> >>>> a vmovd as expected.
> >>>>
> >>>> At least the spilling is gone here.
> >>>>
> >>>> I am re-testing as follows, the main change is that
> >>>> general_scalar_chain::make_vector_copies now generates a
> >>>> vector pseudo as destination (and I've fixed up the code
> >>>> to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
> >>>>
> >>>> Hope this fixes the observed slowdowns (it fixes the new testcase).
> >>>
> >>> It fixes the slowdown observed in 416.gamess and 464.h264ref.
> >>>
> >>> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
> >>>
> >>> CCing Jeff who "knows RTL".
> >>>
> >>> OK?
> >>
> >> Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid
> >> spurious test failures on SSE4.1 targets.
> >
> > Done.  I've also adjusted the i386.md changelog as follows:
> >
> >         * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
> >         (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
> >         (*<maxmin>di3_doubleword): Likewise.
> >
> > I see
> >
> > FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest
> > FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest
> > FAIL: gcc.target/i386/pr78794.c scan-assembler pandn
> >
> > with the latest patch (this is with -m32) where -mstv causes
> > all spills to go away and the cmoves replaced (so clearly
> > better code after the patch) for pr65105-5.c, no obvious
> > improvements for pr65105-3.c where cmov does appear with -mstv.
> > I'd rather not "fix" those by adding -mno-stv but instead have
> > the Intel people fix costing for slm and/or decide what to do.
> > For pr65105-3.c I'm not sure why if-conversion didn't choose
> > to use cmov, so clearly the enabled minmax patterns expose the
> > "failure" here.
> I'm not sure how much effort Intel is putting into Silvermont tuning
> these days.  So I'd suggest giving HJ a heads-up and a reasonable period
> of time to take a looksie, but I wouldn't hold the patch for long due to
> a Silvermont tuning issue.

Leave pr65105-3.c to fail for now.  We can take a look later.

Thanks.

-- 
H.J.

Reply via email to