Hi!

On Fri, Sep 18, 2020 at 01:17:41AM -0500, Xiong Hu Luo wrote:
> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
> to be insert, arg2 is the place to insert arg1 to arg0.  Current expander
> generates stxv+stwx+lxv if arg2 is variable instead of constant, which
> causes serious store hit load performance issue on Power.  This patch tries
>  1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n&3] = i to
> unify the gimple code, then expander could use vec_set_optab to expand.
>  2) Expand the IFN VEC_SET to fast instructions: lvsl+xxperm+xxsel.
> In this way, "vec_insert (i, v, n)" and "v[n&3] = i" won't be expanded too
> early in gimple stage if arg2 is variable, avoid generating store hit load
> instructions.

Sounds good.

> For Power9 V4SI:
>       addi 9,1,-16
>       rldic 6,6,2,60
>       stxv 34,-16(1)
>       stwx 5,9,6
>       lxv 34,-16(1)
> =>
>       addis 9,2,.LC0@toc@ha
>       addi 9,9,.LC0@toc@l
>       mtvsrwz 33,5
>       lxv 32,0(9)
>       sradi 9,6,2
>       addze 9,9

These last two insns are a signed modulo.  That seems wrong.

>       sldi 9,9,2
>       subf 9,9,6
>       subfic 9,9,3
>       sldi 9,9,2
>       subfic 9,9,20

This should be optimised quite a bit.  Why isn't it?  Is this -O0
output?  That isn't interesting, please show -O2 instead.

>       lvsl 13,0,9
>       xxperm 33,33,45
>       xxperm 32,32,45
>       xxsel 34,34,33,32

It feels like this can be done in fewer insns.  Hrm.  Since you already
rotate one of the data and the mask, does it work better if you rotate
to some known position and then use vinsertw (and then rotate again)?

> Though instructions increase from 5 to 15, the performance is improved
> 60% in typical cases.

Yeah, great :-)

> Tested with V2DI, V2DF V4SI, V4SF, V8HI, V16QI

That is what the new testcases do, right?

>       * config/rs6000/altivec.md (altivec_lvsl_reg_<mode>2): Rename to
>        (altivec_lvsl_reg_<mode>2) and extend to SDI mode.

(wrong leading space)

You say the same thing is renamed to the same thing (and not how we
normally do that).

"extend to SDI mode"...  Please rephrase, "extend" usually means sign-
or zero-extension in RTL.  But, see below.

>       * config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
>       Ajdust variable index vec_insert to VIEW_CONVERT_EXPR.

"Adjust", but this changelog entry does not tell me what the change is.

>       * config/rs6000/rs6000-protos.h (rs6000_expand_vector_set_var):
>       New declare.

(declaration)

>       * config/rs6000/vsx.md: Call gen_altivec_lvsl_reg_di2.

>From some specific define_something, surely?

> -(define_insn "altivec_lvsl_reg"
> +(define_insn "altivec_lvsl_reg_<mode>2"
>    [(set (match_operand:V16QI 0 "altivec_register_operand" "=v")
>       (unspec:V16QI
> -     [(match_operand:DI 1 "gpc_reg_operand" "b")]
> +     [(match_operand:SDI 1 "gpc_reg_operand" "b")]
>       UNSPEC_LVSL_REG))]
>    "TARGET_ALTIVEC"
>    "lvsl %0,0,%1"

"DI" is wrong, yes; but so is SDI.  It should just be P here, afaics?
SI on -m32, DI on -m64.

> +/* Insert value from VEC into idx of TARGET.  */

What does idx count?  Bytes?  Things the same size as "value"?

What does "value" mean, anyway?  Do you mean "VAL" (we use upercase for
all parameter names; "IDX" as well).

> +void
> +rs6000_expand_vector_set_var (rtx target, rtx vec, rtx val, rtx idx)
> +{
> +  machine_mode mode = GET_MODE (vec);
> +
> +  if (VECTOR_MEM_VSX_P (mode) && CONST_INT_P (idx))
> +    gcc_unreachable ();
> +  else if (VECTOR_MEM_VSX_P (mode) && !CONST_INT_P (idx)
> +        && TARGET_DIRECT_MOVE_64BIT)

You already handled the CONST_INT_P case a line before, CONST_INT_P is
always true here.

This actually does *nothing* if this if() isn't true, which cannot be
right.

Maybe you want to replace these four lines with just
  gcc_assert (VECTOR_MEM_VSX_P (mode) && !CONST_INT_P (idx));

and handle !TARGET_DIRECT_MOVE_64BIT some way as well (maybe this cannot
be called then; assert it as well, then, or at least document it is
required).

> +       if (!BYTES_BIG_ENDIAN)
> +         {
> +           /*  idx = 1 - idx.  */
> +           emit_insn (gen_subsi3 (tmp, GEN_INT (1), idx));
> +           /*  idx = idx * 8.  */
> +           emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (3)));
> +           /*  idx = 16 - idx.  */
> +           emit_insn (gen_subsi3 (tmp, GEN_INT (16), tmp));
> +         }

Maybe combine this here already?  16 - 8*(1 - idx) is the same as
8 + 8*idx.  Which might not be correct at all?

(GEN_INT (1) is const1_rtx, btw.)

> +      else if (GET_MODE_SIZE (inner_mode) == 4)
> +     {
> +       if (!BYTES_BIG_ENDIAN)
> +         {
> +           /*  idx = 3 - idx.  */
> +           emit_insn (gen_subsi3 (tmp, GEN_INT (3), idx));
> +           /*  idx = idx * 4.  */
> +           emit_insn (gen_ashlsi3 (tmp, tmp, GEN_INT (2)));
> +           /*  idx = 20 - idx.  */
> +           emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
> +         }
> +       else
> +       {
> +           emit_insn (gen_ashlsi3 (tmp, idx, GEN_INT (2)));
> +           emit_insn (gen_subsi3 (tmp, GEN_INT (20), tmp));
> +       }
> +     }

Please have all the different sizes in just one block, and compute the
contants you need there.

> +      /*  lxv vs32, mask.
> +       DImode: 0xffffffffffffffff0000000000000000
> +       SImode: 0x00000000ffffffff0000000000000000
> +       HImode: 0x000000000000ffff0000000000000000.
> +       QImode: 0x00000000000000ff0000000000000000.  */
> +      rtx mask = gen_reg_rtx (V16QImode);
> +      rtx mask_v2di = gen_reg_rtx (V2DImode);
> +      rtvec v = rtvec_alloc (2);
> +      if (!BYTES_BIG_ENDIAN)
> +     {
> +       RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, 0);
> +       RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, mode_mask);
> +     }
> +      else
> +      {
> +       RTVEC_ELT (v, 0) = gen_rtx_CONST_INT (DImode, mode_mask);
> +       RTVEC_ELT (v, 1) = gen_rtx_CONST_INT (DImode, 0);
> +     }
> +      emit_insn (
> +     gen_vec_initv2didi (mask_v2di, gen_rtx_PARALLEL (V2DImode, v)));

Please use two (or more) statements if things like this do not fit.  If
it is hard to layout code, you almost always should factor the code a
bit, and it will be better for it.

> +      /*  mtvsrd[wz] f0,val.  */
> +      rtx val_v16qi = gen_reg_rtx (V16QImode);
> +      switch (inner_mode)

So it looks like we want a parameterized name for p8_mtvsrwz, and for
p8_mtvsrd as well.  And then you can do the switch as just an if() on
the size of inner_mode (whether it is <= 4 or not).

> +      /*  lvsl    v1,0,idx.  */
> +      rtx pcv = gen_reg_rtx (V16QImode);
> +      emit_insn (gen_altivec_lvsl_reg_si2 (pcv, tmp));
> +
> +      /*  xxperm  vs0,vs0,vs33.  */
> +      /*  xxperm  vs32,vs32,vs33.  */

But you generate vperm, instead!  vperm has a parameter more, and is
much easier to understand because of that.

> +(define_mode_iterator FQHS [SF QI HI SI])
> +(define_mode_iterator FD [DF DI])

Those are pretty bad names.  The concept does not make much sense
either?  SF is normally stored in registers as DF, so that makes it
even weirder.

Why have floating point modes here at all?

> +(define_insn "p8_mtvsrwz_v16qi<mode>2"
> +  [(set (match_operand:V16QI 0 "register_operand" "=wa")
> +     (unspec:V16QI [(match_operand:FQHS 1 "register_operand" "r")]
> +                UNSPEC_P8V_MTVSRWZ))]
> +  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
> +  "mtvsrwz %x0,%1"
> +  [(set_attr "type" "mftgpr")])

This insn does not require TARGET_POWERPC64, it only looks at the low
32 bits of a GPR after all.

> +(define_insn "p8_mtvsrd_v16qi<mode>2"
> +  [(set (match_operand:V16QI 0 "register_operand" "=wa")
> +     (unspec:V16QI [(match_operand:FD 1 "register_operand" "r")]
> +                UNSPEC_P8V_MTVSRD))]
> +  "TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
> +  "mtvsrd %x0,%1"
> +  [(set_attr "type" "mftgpr")])

All this can be written without unspecs (and it should bet like all
existing patterns that use it!)  Unspec is the great optimisation
inhibitor.

What do you need new patterns for, anyway?

>  (define_expand "vec_set<mode>"

[...]

> +  if (CONST_INT_P (operands[2]))
> +    {
> +      rs6000_expand_vector_set (operands[0], operands[1], INTVAL 
> (operands[2]));
> +      DONE;
> +    }
> +  else
> +    {
> +      rtx target = gen_reg_rtx (V16QImode);
> +      rs6000_expand_vector_set_var (target, operands[0], operands[1],
> +                                 operands[2]);
> +      rtx sub_target
> +     = simplify_gen_subreg (GET_MODE (operands[0]), target, V16QImode, 0);
> +      emit_insn (gen_rtx_SET (operands[0], sub_target));
> +      DONE;
> +    }

Please handle this in rs6000_expand_vector_set, instead.  It is fine
to call rs6000_vector_set_var early in that, for example.

> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index dd750210758..7e82690d12d 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -5349,7 +5349,7 @@ (define_expand "xl_len_r"
>    rtx rtx_vtmp = gen_reg_rtx (V16QImode);
>    rtx tmp = gen_reg_rtx (DImode);
>  
> -  emit_insn (gen_altivec_lvsl_reg (shift_mask, operands[2]));
> +  emit_insn (gen_altivec_lvsl_reg_di2 (shift_mask, operands[2]));

So this becomes
  emit_insn (gen_altivec_lvsl_reg (DImode, shift_mask, operands[2]));
if you use parameterized names.


Tests...  You don't need lp64 anywhere, but then you probably need to
disallow the 64-bit tests.  That may be hard?


When you resend, please split it into separate patches for separate
things.  Small patches are fine, no, *good* even!


Segher

Reply via email to