On Fri, 9 Aug 2019, Richard Biener wrote:

> On Fri, 9 Aug 2019, Richard Biener wrote:
> 
> > On Fri, 9 Aug 2019, Uros Bizjak wrote:
> > 
> > > On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubiz...@gmail.com> wrote:
> > > 
> > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI 
> > > > > > > > > "TARGET_AVX512F"])
> > > > > > > > >
> > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > >
> > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > >
> > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use 
> > > > > > > %g0 etc.
> > > > > > > to force use of %zmmN?
> > > > > >
> > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > >
> > > > >     case SMAX:
> > > > >     case SMIN:
> > > > >     case UMAX:
> > > > >     case UMIN:
> > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > >         return false;
> > > > >
> > > > > so there's no way to use AVX512VL for 32bit?
> > > >
> > > > There is a way, but on 32bit targets, we need to split DImode
> > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > This is of course doable, but somehow more complex than simply
> > > > emitting a DImode compare + DImode cmove, which is what current
> > > > splitter does. So, a follow-up task.
> > > 
> > > Please find attached the complete .md part that enables SImode for
> > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> > > targets. The patterns also allows for memory operand 2, so STV has
> > > chance to create the vector pattern with implicit load. In case STV
> > > fails, the memory operand 2 is loaded to the register first;  operand
> > > 2 is used in compare and cmove instruction, so pre-loading of the
> > > operand should be beneficial.
> > 
> > Thanks.
> > 
> > > Also note, that splitting should happen rarely. Due to the cost
> > > function, STV should effectively always convert minmax to a vector
> > > insn.
> > 
> > I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> > this kind of "simple" conversion:
> > 
> >   5.50 │1d0:   test   %esi,%es
> >   0.07 │       mov    $0x0,%ex
> >        │       cmovs  %eax,%es
> >   5.84 │       imul   %r8d,%es
> > 
> > to
> > 
> >   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
> >   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
> >  40.45 │       vmovd  %xmm0,%eax
> >   2.45 │       imul   %r8d,%eax
> > 
> > which looks like a RA artifact in the end.  We spill %esi only
> > with -mstv here as STV introduces a (subreg:V4SI ...) use
> > of a pseudo ultimatively set from di.  STV creates an additional
> > pseudo for this (copy-in) but it places that copy next to the
> > original def rather than next to the start of the chain it
> > converts which is probably the issue why we spill.  And this
> > is because it inserts those at each definition of the pseudo
> > rather than just at the reaching definition(s) or at the
> > uses of the pseudo in the chain (that because there may be
> > defs of that pseudo in the chain itself).  Note that STV emits
> > such "conversion" copies as simple reg-reg moves:
> > 
> > (insn 1094 3 4 2 (set (reg:SI 777)
> >         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
> >      (nil))
> > 
> > but those do not prevail very long (this one gets removed by CSE2).
> > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> > and computes
> > 
> >     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
> >     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> > 
> > so I wonder if STV shouldn't instead emit gpr->xmm moves
> > here (but I guess nothing again prevents RTL optimizers from
> > combining that with the single-use in the max instruction...).
> > 
> > So this boils down to STV splitting live-ranges but other
> > passes undoing that and then RA not considering splitting
> > live-ranges here, arriving at unoptimal allocation.
> > 
> > A testcase showing this issue is (simplified from 464.h264ref
> > UMVLine16Y_11):
> > 
> > unsigned short
> > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > {
> >   if (y != width)
> >     {
> >       y = y < 0 ? 0 : y;
> >       return Pic[y * width];
> >     }
> >   return Pic[y];
> > }
> > 
> > where the condition and the Pic[y] load mimics the other use of y.
> > Different, even worse spilling is generated by
> > 
> > unsigned short
> > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > {
> >   y = y < 0 ? 0 : y;
> >   return Pic[y * width] + y;
> > }
> > 
> > I guess this all shows that STVs "trick" of simply wrapping
> > integer mode pseudos in (subreg:vector-mode ...) is bad?
> > 
> > I've added a (failing) testcase to reflect the above.
> 
> Experimenting a bit with just for the conversion insns using
> V4SImode pseudos we end up preserving those moves (but I
> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
> ends up using movv4si_internal which only leaves us with
> memory for the SImode operand) _plus_ moving the move next
> to the actual use has an effect.  Not necssarily a good one
> though:
> 
>         vpxor   %xmm0, %xmm0, %xmm0
>         vmovaps %xmm0, -16(%rsp)
>         movl    %esi, -16(%rsp)
>         vpmaxsd -16(%rsp), %xmm0, %xmm0
>         vmovd   %xmm0, %eax
> 
> eh?  I guess the lowpart set is not good (my patch has this
> as well, but I got saved by never having vector modes to subset...).
> Using
> 
>     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
>             (const_vector:V4SI [
>                     (const_int 0 [0]) repeated x4
>                 ])
>             (const_int 1 [0x1]))) "t3.c":5:10 -1
> 
> for the move ends up with
> 
>         vpxor   %xmm1, %xmm1, %xmm1
>         vpinsrd $0, %esi, %xmm1, %xmm0
> 
> eh?  LRA chooses the correct alternative here but somehow
> postreload CSE CSEs the zero with the xmm1 clearing, leading
> to the vpinsrd...  (I guess a general issue, not sure if really
> worse - definitely a larger instruction).  Unfortunately
> postreload-cse doesn't add a reg-equal note.  This happens only
> when emitting the reg move before the use, not doing that emits
> a vmovd as expected.
> 
> At least the spilling is gone here.
> 
> I am re-testing as follows, the main change is that
> general_scalar_chain::make_vector_copies now generates a
> vector pseudo as destination (and I've fixed up the code
> to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
> 
> Hope this fixes the observed slowdowns (it fixes the new testcase).

It fixes the slowdown observed in 416.gamess and 464.h264ref.

Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.

CCing Jeff who "knows RTL".

OK?

Thanks,
Richard.

> Richard.
> 
> mccas.F:twotff_ for 416.gamess
> refbuf.c:UMVLine16Y_11 for 464.h264ref
> 
> 2019-08-07  Richard Biener  <rguent...@suse.de>
> 
>       PR target/91154
>       * config/i386/i386-features.h (scalar_chain::scalar_chain): Add
>       mode arguments.
>       (scalar_chain::smode): New member.
>       (scalar_chain::vmode): Likewise.
>       (dimode_scalar_chain): Rename to...
>       (general_scalar_chain): ... this.
>       (general_scalar_chain::general_scalar_chain): Take mode arguments.
>       (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
>       base with TImode and V1TImode.
>       * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
>       (general_scalar_chain::vector_const_cost): Adjust for SImode
>       chains.
>       (general_scalar_chain::compute_convert_gain): Likewise.  Fix
>       reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
>       scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
>       gain if not zero.
>       (general_scalar_chain::replace_with_subreg): Use vmode/smode.
>       Elide the subreg if the reg is already vector.
>       (general_scalar_chain::make_vector_copies): Likewise.  Handle
>       non-DImode chains appropriately.  Use a vector-mode pseudo as
>       destination.
>       (general_scalar_chain::convert_reg): Likewise.
>       (general_scalar_chain::convert_op): Likewise.  Elide the
>       subreg if the reg is already vector.
>       (general_scalar_chain::convert_insn): Likewise.  Add
>       fatal_insn_not_found if the result is not recognized.
>       (convertible_comparison_p): Pass in the scalar mode and use that.
>       (general_scalar_to_vector_candidate_p): Likewise.  Rename from
>       dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
>       (scalar_to_vector_candidate_p): Remove by inlining into single
>       caller.
>       (general_remove_non_convertible_regs): Rename from
>       dimode_remove_non_convertible_regs.
>       (remove_non_convertible_regs): Remove by inlining into single caller.
>       (convert_scalars_to_vector): Handle SImode and DImode chains
>       in addition to TImode chains.
>       * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.
> 
>       * gcc.target/i386/pr91154.c: New testcase.
>       * gcc.target/i386/minmax-3.c: Likewise.
>       * gcc.target/i386/minmax-4.c: Likewise.
>       * gcc.target/i386/minmax-5.c: Likewise.
>       * gcc.target/i386/minmax-6.c: Likewise.
> 
> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c   (revision 274111)
> +++ gcc/config/i386/i386-features.c   (working copy)
> @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
>  
>  /* Initialize new chain.  */
>  
> -scalar_chain::scalar_chain ()
> +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode 
> vmode_)
>  {
> +  smode = smode_;
> +  vmode = vmode_;
> +
>    chain_id = ++max_id;
>  
>     if (dump_file)
> @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
>     conversion.  */
>  
>  void
> -dimode_scalar_chain::mark_dual_mode_def (df_ref def)
> +general_scalar_chain::mark_dual_mode_def (df_ref def)
>  {
>    gcc_assert (DF_REF_REG_DEF_P (def));
>  
> @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
>        && !HARD_REGISTER_P (SET_DEST (def_set)))
>      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>  
> +  /* ???  The following is quadratic since analyze_register_chain
> +     iterates over all refs to look for dual-mode regs.  Instead this
> +     should be done separately for all regs mentioned in the chain once.  */
>    df_ref ref;
>    df_ref def;
>    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
>     instead of using a scalar one.  */
>  
>  int
> -dimode_scalar_chain::vector_const_cost (rtx exp)
> +general_scalar_chain::vector_const_cost (rtx exp)
>  {
>    gcc_assert (CONST_INT_P (exp));
>  
> -  if (standard_sse_constant_p (exp, V2DImode))
> -    return COSTS_N_INSNS (1);
> -  return ix86_cost->sse_load[1];
> +  if (standard_sse_constant_p (exp, vmode))
> +    return ix86_cost->sse_op;
> +  /* We have separate costs for SImode and DImode, use SImode costs
> +     for smaller modes.  */
> +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
>  }
>  
>  /* Compute a gain for chain conversion.  */
>  
>  int
> -dimode_scalar_chain::compute_convert_gain ()
> +general_scalar_chain::compute_convert_gain ()
>  {
>    bitmap_iterator bi;
>    unsigned insn_uid;
> @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
>    if (dump_file)
>      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
>  
> +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> +     int costs factor in the number of GPRs involved.  When supporting
> +     smaller modes than SImode the int load/store costs need to be
> +     adjusted as well.  */
> +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +
>    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>      {
>        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
>        rtx def_set = single_set (insn);
>        rtx src = SET_SRC (def_set);
>        rtx dst = SET_DEST (def_set);
> +      int igain = 0;
>  
>        if (REG_P (src) && REG_P (dst))
> -     gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> +     igain += 2 * m - ix86_cost->xmm_move;
>        else if (REG_P (src) && MEM_P (dst))
> -     gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> +     igain
> +       += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
>        else if (MEM_P (src) && REG_P (dst))
> -     gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> +     igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
>        else if (GET_CODE (src) == ASHIFT
>              || GET_CODE (src) == ASHIFTRT
>              || GET_CODE (src) == LSHIFTRT)
>       {
>         if (CONST_INT_P (XEXP (src, 0)))
> -         gain -= vector_const_cost (XEXP (src, 0));
> -       gain += ix86_cost->shift_const;
> +         igain -= vector_const_cost (XEXP (src, 0));
> +       igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
>         if (INTVAL (XEXP (src, 1)) >= 32)
> -         gain -= COSTS_N_INSNS (1);
> +         igain -= COSTS_N_INSNS (1);
>       }
>        else if (GET_CODE (src) == PLUS
>              || GET_CODE (src) == MINUS
> @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
>              || GET_CODE (src) == XOR
>              || GET_CODE (src) == AND)
>       {
> -       gain += ix86_cost->add;
> +       igain += m * ix86_cost->add - ix86_cost->sse_op;
>         /* Additional gain for andnot for targets without BMI.  */
>         if (GET_CODE (XEXP (src, 0)) == NOT
>             && !TARGET_BMI)
> -         gain += 2 * ix86_cost->add;
> +         igain += m * ix86_cost->add;
>  
>         if (CONST_INT_P (XEXP (src, 0)))
> -         gain -= vector_const_cost (XEXP (src, 0));
> +         igain -= vector_const_cost (XEXP (src, 0));
>         if (CONST_INT_P (XEXP (src, 1)))
> -         gain -= vector_const_cost (XEXP (src, 1));
> +         igain -= vector_const_cost (XEXP (src, 1));
>       }
>        else if (GET_CODE (src) == NEG
>              || GET_CODE (src) == NOT)
> -     gain += ix86_cost->add - COSTS_N_INSNS (1);
> +     igain += m * ix86_cost->add - ix86_cost->sse_op;
> +      else if (GET_CODE (src) == SMAX
> +            || GET_CODE (src) == SMIN
> +            || GET_CODE (src) == UMAX
> +            || GET_CODE (src) == UMIN)
> +     {
> +       /* We do not have any conditional move cost, estimate it as a
> +          reg-reg move.  Comparisons are costed as adds.  */
> +       igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> +       /* Integer SSE ops are all costed the same.  */
> +       igain -= ix86_cost->sse_op;
> +     }
>        else if (GET_CODE (src) == COMPARE)
>       {
>         /* Assume comparison cost is the same.  */
> @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
>        else if (CONST_INT_P (src))
>       {
>         if (REG_P (dst))
> -         gain += COSTS_N_INSNS (2);
> +         /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> +         igain += COSTS_N_INSNS (m);
>         else if (MEM_P (dst))
> -         gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> -       gain -= vector_const_cost (src);
> +         igain += (m * ix86_cost->int_store[2]
> +                  - ix86_cost->sse_store[sse_cost_idx]);
> +       igain -= vector_const_cost (src);
>       }
>        else
>       gcc_unreachable ();
> +
> +      if (igain != 0 && dump_file)
> +     {
> +       fprintf (dump_file, "  Instruction gain %d for ", igain);
> +       dump_insn_slim (dump_file, insn);
> +     }
> +      gain += igain;
>      }
>  
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>  
> +  /* ???  What about integer to SSE?  */
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
>  
> @@ -570,10 +608,11 @@ dimode_scalar_chain::compute_convert_gai
>  /* Replace REG in X with a V2DI subreg of NEW_REG.  */
>  
>  rtx
> -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
>  {
>    if (x == reg)
> -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> +    return (GET_MODE (new_reg) == vmode
> +         ? new_reg : gen_rtx_SUBREG (vmode, new_reg, 0));
>  
>    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
>    int i, j;
> @@ -593,7 +632,7 @@ dimode_scalar_chain::replace_with_subreg
>  /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
>  
>  void
> -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
> +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
>                                                 rtx reg, rtx new_reg)
>  {
>    replace_with_subreg (single_set (insn), reg, new_reg);
> @@ -624,10 +663,10 @@ scalar_chain::emit_conversion_insns (rtx
>     and replace its uses in a chain.  */
>  
>  void
> -dimode_scalar_chain::make_vector_copies (unsigned regno)
> +general_scalar_chain::make_vector_copies (unsigned regno)
>  {
>    rtx reg = regno_reg_rtx[regno];
> -  rtx vreg = gen_reg_rtx (DImode);
> +  rtx vreg = gen_reg_rtx (vmode);
>    df_ref ref;
>  
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> @@ -636,36 +675,59 @@ dimode_scalar_chain::make_vector_copies
>       start_sequence ();
>       if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
>         {
> -         rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> -         emit_move_insn (adjust_address (tmp, SImode, 0),
> -                         gen_rtx_SUBREG (SImode, reg, 0));
> -         emit_move_insn (adjust_address (tmp, SImode, 4),
> -                         gen_rtx_SUBREG (SImode, reg, 4));
> -         emit_move_insn (vreg, tmp);
> +         rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> +         if (smode == DImode && !TARGET_64BIT)
> +           {
> +             emit_move_insn (adjust_address (tmp, SImode, 0),
> +                             gen_rtx_SUBREG (SImode, reg, 0));
> +             emit_move_insn (adjust_address (tmp, SImode, 4),
> +                             gen_rtx_SUBREG (SImode, reg, 4));
> +           }
> +         else
> +           emit_move_insn (tmp, reg);
> +         emit_move_insn (vreg,
> +                         gen_rtx_VEC_MERGE (vmode,
> +                                            gen_rtx_VEC_DUPLICATE (vmode,
> +                                                                   tmp),
> +                                            CONST0_RTX (vmode),
> +                                            GEN_INT (HOST_WIDE_INT_1U)));
> +
>         }
> -     else if (TARGET_SSE4_1)
> +     else if (!TARGET_64BIT && smode == DImode)
>         {
> -         emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                     CONST0_RTX (V4SImode),
> -                                     gen_rtx_SUBREG (SImode, reg, 0)));
> -         emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       gen_rtx_SUBREG (SImode, reg, 4),
> -                                       GEN_INT (2)));
> +         if (TARGET_SSE4_1)
> +           {
> +             emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                         CONST0_RTX (V4SImode),
> +                                         gen_rtx_SUBREG (SImode, reg, 0)));
> +             emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 
> 0),
> +                                           gen_rtx_SUBREG (V4SImode, vreg, 
> 0),
> +                                           gen_rtx_SUBREG (SImode, reg, 4),
> +                                           GEN_INT (2)));
> +           }
> +         else
> +           {
> +             rtx tmp = gen_reg_rtx (DImode);
> +             emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                         CONST0_RTX (V4SImode),
> +                                         gen_rtx_SUBREG (SImode, reg, 0)));
> +             emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> +                                         CONST0_RTX (V4SImode),
> +                                         gen_rtx_SUBREG (SImode, reg, 4)));
> +             emit_insn (gen_vec_interleave_lowv4si
> +                        (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                         gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                         gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +           }
>         }
>       else
>         {
> -         rtx tmp = gen_reg_rtx (DImode);
> -         emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                     CONST0_RTX (V4SImode),
> -                                     gen_rtx_SUBREG (SImode, reg, 0)));
> -         emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> -                                     CONST0_RTX (V4SImode),
> -                                     gen_rtx_SUBREG (SImode, reg, 4)));
> -         emit_insn (gen_vec_interleave_lowv4si
> -                    (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                     gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                     gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +         emit_move_insn (vreg,
> +                         gen_rtx_VEC_MERGE (vmode,
> +                                            gen_rtx_VEC_DUPLICATE (vmode,
> +                                                                   reg),
> +                                            CONST0_RTX (vmode),
> +                                            GEN_INT (HOST_WIDE_INT_1U)));
>         }
>       rtx_insn *seq = get_insns ();
>       end_sequence ();
> @@ -695,7 +757,7 @@ dimode_scalar_chain::make_vector_copies
>     in case register is used in not convertible insn.  */
>  
>  void
> -dimode_scalar_chain::convert_reg (unsigned regno)
> +general_scalar_chain::convert_reg (unsigned regno)
>  {
>    bool scalar_copy = bitmap_bit_p (defs_conv, regno);
>    rtx reg = regno_reg_rtx[regno];
> @@ -707,7 +769,7 @@ dimode_scalar_chain::convert_reg (unsign
>    bitmap_copy (conv, insns);
>  
>    if (scalar_copy)
> -    scopy = gen_reg_rtx (DImode);
> +    scopy = gen_reg_rtx (smode);
>  
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
>      {
> @@ -727,40 +789,55 @@ dimode_scalar_chain::convert_reg (unsign
>         start_sequence ();
>         if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
>           {
> -           rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> +           rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
>             emit_move_insn (tmp, reg);
> -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                           adjust_address (tmp, SImode, 0));
> -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                           adjust_address (tmp, SImode, 4));
> +           if (!TARGET_64BIT && smode == DImode)
> +             {
> +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                               adjust_address (tmp, SImode, 0));
> +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                               adjust_address (tmp, SImode, 4));
> +             }
> +           else
> +             emit_move_insn (scopy, tmp);
>           }
> -       else if (TARGET_SSE4_1)
> +       else if (!TARGET_64BIT && smode == DImode)
>           {
> -           rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> -           emit_insn
> -             (gen_rtx_SET
> -              (gen_rtx_SUBREG (SImode, scopy, 0),
> -               gen_rtx_VEC_SELECT (SImode,
> -                                   gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> -
> -           tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> -           emit_insn
> -             (gen_rtx_SET
> -              (gen_rtx_SUBREG (SImode, scopy, 4),
> -               gen_rtx_VEC_SELECT (SImode,
> -                                   gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> +           if (TARGET_SSE4_1)
> +             {
> +               rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> +                                           gen_rtvec (1, const0_rtx));
> +               emit_insn
> +                 (gen_rtx_SET
> +                    (gen_rtx_SUBREG (SImode, scopy, 0),
> +                     gen_rtx_VEC_SELECT (SImode,
> +                                         gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                         tmp)));
> +
> +               tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> +               emit_insn
> +                 (gen_rtx_SET
> +                    (gen_rtx_SUBREG (SImode, scopy, 4),
> +                     gen_rtx_VEC_SELECT (SImode,
> +                                         gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                         tmp)));
> +             }
> +           else
> +             {
> +               rtx vcopy = gen_reg_rtx (V2DImode);
> +               emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                               gen_rtx_SUBREG (SImode, vcopy, 0));
> +               emit_move_insn (vcopy,
> +                               gen_rtx_LSHIFTRT (V2DImode,
> +                                                 vcopy, GEN_INT (32)));
> +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                               gen_rtx_SUBREG (SImode, vcopy, 0));
> +             }
>           }
>         else
> -         {
> -           rtx vcopy = gen_reg_rtx (V2DImode);
> -           emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                           gen_rtx_SUBREG (SImode, vcopy, 0));
> -           emit_move_insn (vcopy,
> -                           gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                           gen_rtx_SUBREG (SImode, vcopy, 0));
> -         }
> +         emit_move_insn (scopy, reg);
> +
>         rtx_insn *seq = get_insns ();
>         end_sequence ();
>         emit_conversion_insns (seq, insn);
> @@ -809,21 +886,21 @@ dimode_scalar_chain::convert_reg (unsign
>     registers conversion.  */
>  
>  void
> -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
> +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
>  {
>    *op = copy_rtx_if_shared (*op);
>  
>    if (GET_CODE (*op) == NOT)
>      {
>        convert_op (&XEXP (*op, 0), insn);
> -      PUT_MODE (*op, V2DImode);
> +      PUT_MODE (*op, vmode);
>      }
>    else if (MEM_P (*op))
>      {
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
>  
>        emit_insn_before (gen_move_insn (tmp, *op), insn);
> -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
>  
>        if (dump_file)
>       fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> @@ -841,24 +918,31 @@ dimode_scalar_chain::convert_op (rtx *op
>           gcc_assert (!DF_REF_CHAIN (ref));
>           break;
>         }
> -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> +      if (GET_MODE (*op) != vmode)
> +     *op = gen_rtx_SUBREG (vmode, *op, 0);
>      }
>    else if (CONST_INT_P (*op))
>      {
>        rtx vec_cst;
> -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
>  
>        /* Prefer all ones vector in case of -1.  */
>        if (constm1_operand (*op, GET_MODE (*op)))
> -     vec_cst = CONSTM1_RTX (V2DImode);
> +     vec_cst = CONSTM1_RTX (vmode);
>        else
> -     vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> -                                     gen_rtvec (2, *op, const0_rtx));
> +     {
> +       unsigned n = GET_MODE_NUNITS (vmode);
> +       rtx *v = XALLOCAVEC (rtx, n);
> +       v[0] = *op;
> +       for (unsigned i = 1; i < n; ++i)
> +         v[i] = const0_rtx;
> +       vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> +     }
>  
> -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> +      if (!standard_sse_constant_p (vec_cst, vmode))
>       {
>         start_sequence ();
> -       vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> +       vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
>         rtx_insn *seq = get_insns ();
>         end_sequence ();
>         emit_insn_before (seq, insn);
> @@ -870,14 +954,14 @@ dimode_scalar_chain::convert_op (rtx *op
>    else
>      {
>        gcc_assert (SUBREG_P (*op));
> -      gcc_assert (GET_MODE (*op) == V2DImode);
> +      gcc_assert (GET_MODE (*op) == vmode);
>      }
>  }
>  
>  /* Convert INSN to vector mode.  */
>  
>  void
> -dimode_scalar_chain::convert_insn (rtx_insn *insn)
> +general_scalar_chain::convert_insn (rtx_insn *insn)
>  {
>    rtx def_set = single_set (insn);
>    rtx src = SET_SRC (def_set);
> @@ -888,9 +972,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>      {
>        /* There are no scalar integer instructions and therefore
>        temporary register usage is required.  */
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (dst));
>        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
>      }
>  
>    switch (GET_CODE (src))
> @@ -899,7 +983,7 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case ASHIFTRT:
>      case LSHIFTRT:
>        convert_op (&XEXP (src, 0), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>  
>      case PLUS:
> @@ -907,25 +991,29 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case IOR:
>      case XOR:
>      case AND:
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
>        convert_op (&XEXP (src, 0), insn);
>        convert_op (&XEXP (src, 1), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>  
>      case NEG:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> +      src = gen_rtx_MINUS (vmode, subreg, src);
>        break;
>  
>      case NOT:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), 
> insn);
> -      src = gen_rtx_XOR (V2DImode, src, subreg);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> +      src = gen_rtx_XOR (vmode, src, subreg);
>        break;
>  
>      case MEM:
> @@ -939,17 +1027,17 @@ dimode_scalar_chain::convert_insn (rtx_i
>        break;
>  
>      case SUBREG:
> -      gcc_assert (GET_MODE (src) == V2DImode);
> +      gcc_assert (GET_MODE (src) == vmode);
>        break;
>  
>      case COMPARE:
>        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
>  
> -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> -               || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> +               || (SUBREG_P (src) && GET_MODE (src) == vmode));
>  
>        if (REG_P (src))
> -     subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> +     subreg = gen_rtx_SUBREG (vmode, src, 0);
>        else
>       subreg = copy_rtx_if_shared (src);
>        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared 
> (subreg),
> @@ -977,7 +1065,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>    PATTERN (insn) = def_set;
>  
>    INSN_CODE (insn) = -1;
> -  recog_memoized (insn);
> +  int patt = recog_memoized (insn);
> +  if  (patt == -1)
> +    fatal_insn_not_found (insn);
>    df_insn_rescan (insn);
>  }
>  
> @@ -1116,7 +1206,7 @@ timode_scalar_chain::convert_insn (rtx_i
>  }
>  
>  void
> -dimode_scalar_chain::convert_registers ()
> +general_scalar_chain::convert_registers ()
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1186,7 +1276,7 @@ has_non_address_hard_reg (rtx_insn *insn
>                    (const_int 0 [0])))  */
>  
>  static bool
> -convertible_comparison_p (rtx_insn *insn)
> +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    if (!TARGET_SSE4_1)
>      return false;
> @@ -1219,12 +1309,12 @@ convertible_comparison_p (rtx_insn *insn
>  
>    if (!SUBREG_P (op1)
>        || !SUBREG_P (op2)
> -      || GET_MODE (op1) != SImode
> -      || GET_MODE (op2) != SImode
> +      || GET_MODE (op1) != mode
> +      || GET_MODE (op2) != mode
>        || ((SUBREG_BYTE (op1) != 0
> -        || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> +        || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
>         && (SUBREG_BYTE (op2) != 0
> -           || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> +           || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
>      return false;
>  
>    op1 = SUBREG_REG (op1);
> @@ -1232,7 +1322,7 @@ convertible_comparison_p (rtx_insn *insn
>  
>    if (op1 != op2
>        || !REG_P (op1)
> -      || GET_MODE (op1) != DImode)
> +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
>      return false;
>  
>    return true;
> @@ -1241,7 +1331,7 @@ convertible_comparison_p (rtx_insn *insn
>  /* The DImode version of scalar_to_vector_candidate_p.  */
>  
>  static bool
> -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    rtx def_set = single_set (insn);
>  
> @@ -1255,12 +1345,12 @@ dimode_scalar_to_vector_candidate_p (rtx
>    rtx dst = SET_DEST (def_set);
>  
>    if (GET_CODE (src) == COMPARE)
> -    return convertible_comparison_p (insn);
> +    return convertible_comparison_p (insn, mode);
>  
>    /* We are interested in DImode promotion only.  */
> -  if ((GET_MODE (src) != DImode
> +  if ((GET_MODE (src) != mode
>         && !CONST_INT_P (src))
> -      || GET_MODE (dst) != DImode)
> +      || GET_MODE (dst) != mode)
>      return false;
>  
>    if (!REG_P (dst) && !MEM_P (dst))
> @@ -1280,6 +1370,15 @@ dimode_scalar_to_vector_candidate_p (rtx
>       return false;
>        break;
>  
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
> +      if ((mode == DImode && !TARGET_AVX512VL)
> +       || (mode == SImode && !TARGET_SSE4_1))
> +     return false;
> +      /* Fallthru.  */
> +
>      case PLUS:
>      case MINUS:
>      case IOR:
> @@ -1290,7 +1389,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>         && !CONST_INT_P (XEXP (src, 1)))
>       return false;
>  
> -      if (GET_MODE (XEXP (src, 1)) != DImode
> +      if (GET_MODE (XEXP (src, 1)) != mode
>         && !CONST_INT_P (XEXP (src, 1)))
>       return false;
>        break;
> @@ -1319,7 +1418,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>         || !REG_P (XEXP (XEXP (src, 0), 0))))
>        return false;
>  
> -  if (GET_MODE (XEXP (src, 0)) != DImode
> +  if (GET_MODE (XEXP (src, 0)) != mode
>        && !CONST_INT_P (XEXP (src, 0)))
>      return false;
>  
> @@ -1383,22 +1482,16 @@ timode_scalar_to_vector_candidate_p (rtx
>    return false;
>  }
>  
> -/* Return 1 if INSN may be converted into vector
> -   instruction.  */
> -
> -static bool
> -scalar_to_vector_candidate_p (rtx_insn *insn)
> -{
> -  if (TARGET_64BIT)
> -    return timode_scalar_to_vector_candidate_p (insn);
> -  else
> -    return dimode_scalar_to_vector_candidate_p (insn);
> -}
> +/* For a given bitmap of insn UIDs scans all instruction and
> +   remove insn from CANDIDATES in case it has both convertible
> +   and not convertible definitions.
>  
> -/* The DImode version of remove_non_convertible_regs.  */
> +   All insns in a bitmap are conversion candidates according to
> +   scalar_to_vector_candidate_p.  Currently it implies all insns
> +   are single_set.  */
>  
>  static void
> -dimode_remove_non_convertible_regs (bitmap candidates)
> +general_remove_non_convertible_regs (bitmap candidates)
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1553,23 +1646,6 @@ timode_remove_non_convertible_regs (bitm
>    BITMAP_FREE (regs);
>  }
>  
> -/* For a given bitmap of insn UIDs scans all instruction and
> -   remove insn from CANDIDATES in case it has both convertible
> -   and not convertible definitions.
> -
> -   All insns in a bitmap are conversion candidates according to
> -   scalar_to_vector_candidate_p.  Currently it implies all insns
> -   are single_set.  */
> -
> -static void
> -remove_non_convertible_regs (bitmap candidates)
> -{
> -  if (TARGET_64BIT)
> -    timode_remove_non_convertible_regs (candidates);
> -  else
> -    dimode_remove_non_convertible_regs (candidates);
> -}
> -
>  /* Main STV pass function.  Find and convert scalar
>     instructions into vector mode when profitable.  */
>  
> @@ -1577,11 +1653,14 @@ static unsigned int
>  convert_scalars_to_vector ()
>  {
>    basic_block bb;
> -  bitmap candidates;
>    int converted_insns = 0;
>  
>    bitmap_obstack_initialize (NULL);
> -  candidates = BITMAP_ALLOC (NULL);
> +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> +  for (unsigned i = 0; i < 3; ++i)
> +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
>  
>    calculate_dominance_info (CDI_DOMINATORS);
>    df_set_flags (DF_DEFER_INSN_RESCAN);
> @@ -1597,51 +1676,73 @@ convert_scalars_to_vector ()
>      {
>        rtx_insn *insn;
>        FOR_BB_INSNS (bb, insn)
> -     if (scalar_to_vector_candidate_p (insn))
> +     if (TARGET_64BIT
> +         && timode_scalar_to_vector_candidate_p (insn))
>         {
>           if (dump_file)
> -           fprintf (dump_file, "  insn %d is marked as a candidate\n",
> +           fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
>                      INSN_UID (insn));
>  
> -         bitmap_set_bit (candidates, INSN_UID (insn));
> +         bitmap_set_bit (&candidates[2], INSN_UID (insn));
> +       }
> +     else
> +       {
> +         /* Check {SI,DI}mode.  */
> +         for (unsigned i = 0; i <= 1; ++i)
> +           if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> +             {
> +               if (dump_file)
> +                 fprintf (dump_file, "  insn %d is marked as a %s 
> candidate\n",
> +                          INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> +
> +               bitmap_set_bit (&candidates[i], INSN_UID (insn));
> +               break;
> +             }
>         }
>      }
>  
> -  remove_non_convertible_regs (candidates);
> +  if (TARGET_64BIT)
> +    timode_remove_non_convertible_regs (&candidates[2]);
> +  for (unsigned i = 0; i <= 1; ++i)
> +    general_remove_non_convertible_regs (&candidates[i]);
>  
> -  if (bitmap_empty_p (candidates))
> -    if (dump_file)
> +  for (unsigned i = 0; i <= 2; ++i)
> +    if (!bitmap_empty_p (&candidates[i]))
> +      break;
> +    else if (i == 2 && dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
>  
> -  while (!bitmap_empty_p (candidates))
> -    {
> -      unsigned uid = bitmap_first_set_bit (candidates);
> -      scalar_chain *chain;
> +  for (unsigned i = 0; i <= 2; ++i)
> +    while (!bitmap_empty_p (&candidates[i]))
> +      {
> +     unsigned uid = bitmap_first_set_bit (&candidates[i]);
> +     scalar_chain *chain;
>  
> -      if (TARGET_64BIT)
> -     chain = new timode_scalar_chain;
> -      else
> -     chain = new dimode_scalar_chain;
> +     if (cand_mode[i] == TImode)
> +       chain = new timode_scalar_chain;
> +     else
> +       chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
>  
> -      /* Find instructions chain we want to convert to vector mode.
> -      Check all uses and definitions to estimate all required
> -      conversions.  */
> -      chain->build (candidates, uid);
> +     /* Find instructions chain we want to convert to vector mode.
> +        Check all uses and definitions to estimate all required
> +        conversions.  */
> +     chain->build (&candidates[i], uid);
>  
> -      if (chain->compute_convert_gain () > 0)
> -     converted_insns += chain->convert ();
> -      else
> -     if (dump_file)
> -       fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> -                chain->chain_id);
> +     if (chain->compute_convert_gain () > 0)
> +       converted_insns += chain->convert ();
> +     else
> +       if (dump_file)
> +         fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> +                  chain->chain_id);
>  
> -      delete chain;
> -    }
> +     delete chain;
> +      }
>  
>    if (dump_file)
>      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
>  
> -  BITMAP_FREE (candidates);
> +  for (unsigned i = 0; i <= 2; ++i)
> +    bitmap_release (&candidates[i]);
>    bitmap_obstack_release (NULL);
>    df_process_deferred_rescans ();
>  
> Index: gcc/config/i386/i386-features.h
> ===================================================================
> --- gcc/config/i386/i386-features.h   (revision 274111)
> +++ gcc/config/i386/i386-features.h   (working copy)
> @@ -127,11 +127,16 @@ namespace {
>  class scalar_chain
>  {
>   public:
> -  scalar_chain ();
> +  scalar_chain (enum machine_mode, enum machine_mode);
>    virtual ~scalar_chain ();
>  
>    static unsigned max_id;
>  
> +  /* Scalar mode.  */
> +  enum machine_mode smode;
> +  /* Vector mode.  */
> +  enum machine_mode vmode;
> +
>    /* ID of a chain.  */
>    unsigned int chain_id;
>    /* A queue of instructions to be included into a chain.  */
> @@ -159,9 +164,11 @@ class scalar_chain
>    virtual void convert_registers () = 0;
>  };
>  
> -class dimode_scalar_chain : public scalar_chain
> +class general_scalar_chain : public scalar_chain
>  {
>   public:
> +  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> +    : scalar_chain (smode_, vmode_) {}
>    int compute_convert_gain ();
>   private:
>    void mark_dual_mode_def (df_ref def);
> @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
>  class timode_scalar_chain : public scalar_chain
>  {
>   public:
> +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> +
>    /* Convert from TImode to V1TImode is always faster.  */
>    int compute_convert_gain () { return 1; }
>  
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md   (revision 274111)
> +++ gcc/config/i386/i386.md   (working copy)
> @@ -17729,6 +17729,110 @@ (define_expand "add<mode>cc"
>     (match_operand:SWI 3 "const_int_operand")]
>    ""
>    "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;")
> +
> +;; min/max patterns
> +
> +(define_mode_iterator MAXMIN_IMODE
> +  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
> +(define_code_attr maxmin_rel
> +  [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
> +
> +(define_expand "<code><mode>3"
> +  [(parallel
> +    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
> +       (maxmin:MAXMIN_IMODE
> +         (match_operand:MAXMIN_IMODE 1 "register_operand")
> +         (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
> +     (clobber (reg:CC FLAGS_REG))])]
> +  "TARGET_STV")
> +
> +(define_insn_and_split "*<code><mode>3_1"
> +  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
> +     (maxmin:MAXMIN_IMODE
> +       (match_operand:MAXMIN_IMODE 1 "register_operand")
> +       (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +     (if_then_else:MAXMIN_IMODE (match_dup 3)
> +       (match_dup 1)
> +       (match_dup 2)))]
> +{
> +  machine_mode mode = <MODE>mode;
> +
> +  if (!register_operand (operands[2], mode))
> +    operands[2] = force_reg (mode, operands[2]);
> +
> +  enum rtx_code code = <maxmin_rel>;
> +  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
> +  rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
> +
> +  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
> +  emit_insn (gen_rtx_SET (flags, tmp));
> +
> +  operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
> +})
> +
> +(define_insn_and_split "*<code>di3_doubleword"
> +  [(set (match_operand:DI 0 "register_operand")
> +     (maxmin:DI (match_operand:DI 1 "register_operand")
> +                (match_operand:DI 2 "nonimmediate_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +     (if_then_else:SI (match_dup 6)
> +       (match_dup 1)
> +       (match_dup 2)))
> +   (set (match_dup 3)
> +     (if_then_else:SI (match_dup 6)
> +       (match_dup 4)
> +       (match_dup 5)))]
> +{
> +  if (!register_operand (operands[2], DImode))
> +    operands[2] = force_reg (DImode, operands[2]);
> +
> +  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
> +
> +  rtx cmplo[2] = { operands[1], operands[2] };
> +  rtx cmphi[2] = { operands[4], operands[5] };
> +
> +  enum rtx_code code = <maxmin_rel>;
> +
> +  switch (code)
> +    {
> +    case LE: case LEU:
> +      std::swap (cmplo[0], cmplo[1]);
> +      std::swap (cmphi[0], cmphi[1]);
> +      code = swap_condition (code);
> +      /* FALLTHRU */
> +
> +    case GE: case GEU:
> +      {
> +     bool uns = (code == GEU);
> +     rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx)
> +       = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz;
> +
> +     emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1]));
> +
> +     rtx tmp = gen_rtx_SCRATCH (SImode);
> +     emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1]));
> +
> +     rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG);
> +     operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
> +
> +     break;
> +      }
> +
> +    default:
> +      gcc_unreachable ();
> +    }
> +})
>  
>  ;; Misc patterns (?)
>  
> Index: gcc/testsuite/gcc.target/i386/minmax-3.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-3.c  (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-3.c  (working copy)
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv" } */
> +
> +#define max(a,b) (((a) > (b))? (a) : (b))
> +#define min(a,b) (((a) < (b))? (a) : (b))
> +
> +int ssi[1024];
> +unsigned int usi[1024];
> +long long sdi[1024];
> +unsigned long long udi[1024];
> +
> +#define CHECK(FN, VARIANT) \
> +void \
> +FN ## VARIANT (void) \
> +{ \
> +  for (int i = 1; i < 1024; ++i) \
> +    VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \
> +}
> +
> +CHECK(max, ssi);
> +CHECK(min, ssi);
> +CHECK(max, usi);
> +CHECK(min, usi);
> +CHECK(max, sdi);
> +CHECK(min, sdi);
> +CHECK(max, udi);
> +CHECK(min, udi);
> Index: gcc/testsuite/gcc.target/i386/minmax-4.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-4.c  (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-4.c  (working copy)
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv -msse4.1" } */
> +
> +#include "minmax-3.c"
> +
> +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */
> +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */
> +/* { dg-final { scan-assembler-times "pminsd" 1 } } */
> +/* { dg-final { scan-assembler-times "pminud" 1 } } */
> Index: gcc/testsuite/gcc.target/i386/minmax-6.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-6.c  (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-6.c  (working copy)
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=haswell" } */
> +
> +unsigned short
> +UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> +{
> +  if (y != width)
> +    {
> +      y = y < 0 ? 0 : y;
> +      return Pic[y * width];
> +    }
> +  return Pic[y];
> +} 
> +
> +/* We do not want the RA to spill %esi for it's dual-use but using
> +   pmaxsd is OK.  */
> +/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
> +/* { dg-final { scan-assembler "pmaxsd" } } */

-- 
Richard Biener <rguent...@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)

Reply via email to