On Mon, Jun 23, 2025 at 4:10 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Mon, Jun 23, 2025 at 3:11 PM Hongtao Liu <crazy...@gmail.com> wrote:
> >
> > On Thu, Jun 19, 2025 at 10:25 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > >
> > > Extend the remove_redundant_vector pass to handle vector broadcasts from
> > > constant and variable scalars.  When broadcasting from constants and
> > > function arguments, we can place a single widest vector broadcast at
> > > entry of the nearest common dominator for basic blocks with all uses
> > > since constants and function arguments aren't changed.  For broadcast
> > > from variables with a single definition, the single definition is
> > > replaced with the widest broadcast.
> > >
> > > gcc/
> > >
> > >         PR target/92080
> > >         * config/i386/i386-expand.cc (ix86_expand_call): Set
> > >         recursive_function to true for recursive call.
> > >         * config/i386/i386-features.cc (ix86_place_single_vector_set):
> > >         Add an argument for inner scalar, default to nullptr.  Set the
> > >         source from inner scalar if not nullptr.
> > >         (ix86_get_vector_load_mode): Renamed to ...
> > >         (ix86_get_vector_cse_mode): This.  Add an argument for scalar mode
> > >         and handle integer and float scalar modes.
> > >         (replace_vector_const): Add an argument for scalar mode and pass
> > >         it to ix86_get_vector_load_mode.
> > >         (x86_cse_kind): New.
> > >         (redundant_load): Likewise.
> > >         (ix86_broadcast_inner): Likewise.
> > >         (remove_redundant_vector_load): Also support const0_rtx and
> > >         constm1_rtx broadcasts.  Handle vector broadcasts from constant
> > >         and variable scalars.
> > >         * config/i386/i386.h (machine_function): Add recursive_function.
> > >
> > > gcc/testsuite/
> > >
> > >         * gcc.target/i386/keylocker-aesdecwide128kl.c: Updated to expect
> > >         movdqa instead pxor.
> > >         * gcc.target/i386/keylocker-aesdecwide256kl.c: Likewise.
> > >         * gcc.target/i386/keylocker-aesencwide128kl.c: Likewise.
> > >         * gcc.target/i386/keylocker-aesencwide256kl.c: Likewise.
> > >         * gcc.target/i386/pr92080-4.c: New test.
> > >         * gcc.target/i386/pr92080-5.c: Likewise.
> > >         * gcc.target/i386/pr92080-6.c: Likewise.
> > >         * gcc.target/i386/pr92080-7.c: Likewise.
> > >         * gcc.target/i386/pr92080-8.c: Likewise.
> > >         * gcc.target/i386/pr92080-9.c: Likewise.
> > >         * gcc.target/i386/pr92080-10.c: Likewise.
> > >         * gcc.target/i386/pr92080-11.c: Likewise.
> > >         * gcc.target/i386/pr92080-12.c: Likewise.
> > >         * gcc.target/i386/pr92080-13.c: Likewise.
> > >         * gcc.target/i386/pr92080-14.c: Likewise.
> > >         * gcc.target/i386/pr92080-15.c: Likewise.
> > >         * gcc.target/i386/pr92080-16.c: Likewise.
> > >
> > > Signed-off-by: H.J. Lu <hjl.to...@gmail.com>
> > > ---
> > >  gcc/config/i386/i386-expand.cc                |   3 +
> > >  gcc/config/i386/i386-features.cc              | 410 ++++++++++++++----
> > >  gcc/config/i386/i386.h                        |   3 +
> > >  .../i386/keylocker-aesdecwide128kl.c          |  14 +-
> > >  .../i386/keylocker-aesdecwide256kl.c          |  14 +-
> > >  .../i386/keylocker-aesencwide128kl.c          |  14 +-
> > >  .../i386/keylocker-aesencwide256kl.c          |  14 +-
> > >  gcc/testsuite/gcc.target/i386/pr92080-10.c    |  13 +
> > >  gcc/testsuite/gcc.target/i386/pr92080-11.c    |  33 ++
> > >  gcc/testsuite/gcc.target/i386/pr92080-12.c    |  16 +
> > >  gcc/testsuite/gcc.target/i386/pr92080-13.c    |  32 ++
> > >  gcc/testsuite/gcc.target/i386/pr92080-14.c    |  31 ++
> > >  gcc/testsuite/gcc.target/i386/pr92080-15.c    |  25 ++
> > >  gcc/testsuite/gcc.target/i386/pr92080-16.c    |  26 ++
> > >  gcc/testsuite/gcc.target/i386/pr92080-4.c     |  50 +++
> > >  gcc/testsuite/gcc.target/i386/pr92080-5.c     | 109 +++++
> > >  gcc/testsuite/gcc.target/i386/pr92080-6.c     |  19 +
> > >  gcc/testsuite/gcc.target/i386/pr92080-7.c     |  20 +
> > >  gcc/testsuite/gcc.target/i386/pr92080-8.c     |  16 +
> > >  gcc/testsuite/gcc.target/i386/pr92080-9.c     |  81 ++++
> > >  20 files changed, 823 insertions(+), 120 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-10.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-11.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-12.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-13.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-14.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-15.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-16.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-4.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-5.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-6.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-7.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-8.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-9.c
> > >
> > > diff --git a/gcc/config/i386/i386-expand.cc 
> > > b/gcc/config/i386/i386-expand.cc
> > > index 4946f87a131..60c1ac5d2b1 100644
> > > --- a/gcc/config/i386/i386-expand.cc
> > > +++ b/gcc/config/i386/i386-expand.cc
> > > @@ -10154,6 +10154,9 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx 
> > > callarg1,
> > >           else if (lookup_attribute ("no_callee_saved_registers",
> > >                                      TYPE_ATTRIBUTES (TREE_TYPE 
> > > (fndecl))))
> > >             call_no_callee_saved_registers = true;
> > > +         if (fndecl == current_function_decl
> > > +             && decl_binds_to_current_def_p (fndecl))
> > > +           cfun->machine->recursive_function = true;
> > >         }
> > >      }
> > >    else
> > > diff --git a/gcc/config/i386/i386-features.cc 
> > > b/gcc/config/i386/i386-features.cc
> > > index 56ab7f2d23b..2c49d11da8f 100644
> > > --- a/gcc/config/i386/i386-features.cc
> > > +++ b/gcc/config/i386/i386-features.cc
> > > @@ -3088,10 +3088,12 @@ ix86_rpad_gate ()
> > >  /* Generate a vector set, DEST = SRC, at entry of the nearest dominator
> > >     for basic block map BBS, which is in the fake loop that contains the
> > >     whole function, so that there is only a single vector set in the
> > > -   whole function.   */
> > > +   whole function.  If not nullptr, INNER_SCALAR is the inner scalar of
> > > +   SRC, as (reg:SI 99) in (vec_duplicate:V4SI (reg:SI 99)).  */
> > >
> > >  static void
> > > -ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs)
> > > +ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs,
> > > +                             rtx inner_scalar = nullptr)
> > >  {
> > >    basic_block bb = nearest_common_dominator_for_set (CDI_DOMINATORS, 
> > > bbs);
> > >    while (bb->loop_father->latch
> > > @@ -3112,10 +3114,23 @@ ix86_place_single_vector_set (rtx dest, rtx src, 
> > > bitmap bbs)
> > >        insn = NEXT_INSN (insn);
> > >      }
> > >
> > > +  rtx_insn *set_insn;
> > >    if (insn == BB_HEAD (bb))
> > > -    emit_insn_before (set, insn);
> > > +    set_insn = emit_insn_before (set, insn);
> > >    else
> > > -    emit_insn_after (set, insn ? PREV_INSN (insn) : BB_END (bb));
> > > +    set_insn = emit_insn_after (set,
> > > +                               insn ? PREV_INSN (insn) : BB_END (bb));
> > > +
> > > +  if (inner_scalar)
> > > +    {
> > > +      /* Set the source in (vec_duplicate:V4SI (reg:SI 99)).  */
> > > +      rtx reg = XEXP (src, 0);
> > > +      if ((REG_P (inner_scalar) || MEM_P (inner_scalar))
> > > +         && GET_MODE (reg) != GET_MODE (inner_scalar))
> > > +       inner_scalar = gen_rtx_SUBREG (GET_MODE (reg), inner_scalar, 0);
> > > +      rtx set = gen_rtx_SET (reg, inner_scalar);
> > > +      emit_insn_before (set, set_insn);
> > > +    }
> > >  }
> > >
> > >  /* At entry of the nearest common dominator for basic blocks with
> > > @@ -3346,26 +3361,15 @@ make_pass_remove_partial_avx_dependency 
> > > (gcc::context *ctxt)
> > >    return new pass_remove_partial_avx_dependency (ctxt);
> > >  }
> > >
> > > -/* Return a machine mode suitable for vector SIZE.  */
> > > +/* Return a machine mode suitable for vector SIZE with SMODE inner
> > > +   mode.  */
> > >
> > >  static machine_mode
> > > -ix86_get_vector_load_mode (unsigned int size)
> > > +ix86_get_vector_cse_mode (unsigned int size, machine_mode smode)
> > >  {
> > > -  machine_mode mode;
> > > -  if (size == 64)
> > > -    mode = V64QImode;
> > > -  else if (size == 32)
> > > -    mode = V32QImode;
> > > -  else if (size == 16)
> > > -    mode = V16QImode;
> > > -  else if (size == 8)
> > > -    mode = V8QImode;
> > > -  else if (size == 4)
> > > -    mode = V4QImode;
> > > -  else if (size == 2)
> > > -    mode = V2QImode;
> > > -  else
> > > -    gcc_unreachable ();
> > > +  scalar_mode s_mode = as_a <scalar_mode> (smode);
> > > +  poly_uint64 nunits = size / GET_MODE_SIZE (smode);
> > > +  machine_mode mode = mode_for_vector (s_mode, nunits).require ();
> > >    return mode;
> > >  }
> > >
> > > @@ -3374,7 +3378,8 @@ ix86_get_vector_load_mode (unsigned int size)
> > >
> > >  static void
> > >  replace_vector_const (machine_mode vector_mode, rtx vector_const,
> > > -                     auto_bitmap &vector_insns)
> > > +                     auto_bitmap &vector_insns,
> > > +                     machine_mode scalar_mode)
> > >  {
> > >    bitmap_iterator bi;
> > >    unsigned int id;
> > > @@ -3386,7 +3391,8 @@ replace_vector_const (machine_mode vector_mode, rtx 
> > > vector_const,
> > >        /* Get the single SET instruction.  */
> > >        rtx set = single_set (insn);
> > >        rtx src = SET_SRC (set);
> > > -      machine_mode mode = GET_MODE (src);
> > > +      rtx dest = SET_DEST (set);
> > > +      machine_mode mode = GET_MODE (dest);
> > >
> > >        rtx replace;
> > >        /* Replace the source operand with VECTOR_CONST.  */
> > > @@ -3400,7 +3406,8 @@ replace_vector_const (machine_mode vector_mode, rtx 
> > > vector_const,
> > >               /* If the mode size is smaller than its natural size,
> > >                  first insert an extra move with a QI vector SUBREG
> > >                  of the same size to avoid validate_subreg failure.  */
> > > -             machine_mode vmode = ix86_get_vector_load_mode (size);
> > > +             machine_mode vmode
> > > +               = ix86_get_vector_cse_mode (size, scalar_mode);
> > >               rtx vreg;
> > >               if (mode == vmode)
> > >                 vreg = vector_const;
> > > @@ -3426,6 +3433,169 @@ replace_vector_const (machine_mode vector_mode, 
> > > rtx vector_const,
> > >      }
> > >  }
> > >
> > > +enum x86_cse_kind
> > > +{
> > > +  X86_CSE_CONST0_VECTOR,
> > > +  X86_CSE_CONSTM1_VECTOR,
> > > +  X86_CSE_VEC_DUP
> > > +};
> > > +
> > > +struct redundant_load
> > > +{
> > > +  /* Bitmap of basic blocks with broadcast instructions.  */
> > > +  auto_bitmap bbs;
> > > +  /* Bitmap of broadcast instructions.  */
> > > +  auto_bitmap insns;
> > > +  /* The broadcast inner scalar.  */
> > > +  rtx val;
> > > +  /* The inner scalar mode.  */
> > > +  machine_mode mode;
> > > +  /* The instruction which sets the inner scalar.  Nullptr if the inner
> > > +     scalar is applied to the whole function, instead of within the same
> > > +     block.  */
> > > +  rtx_insn *def_insn;
> > > +  /* The widest broadcast source.  */
> > > +  rtx broadcast_source;
> > > +  /* The widest broadcast register.  */
> > > +  rtx broadcast_reg;
> > > +  /* The basic block of the broadcast instruction.  */
> > > +  basic_block bb;
> > > +  /* The number of broadcast instructions with the same inner scalar.  */
> > > +  unsigned HOST_WIDE_INT count;
> > > +  /* The threshold of broadcast instructions with the same inner
> > > +     scalar.  */
> > > +  unsigned int threshold;
> > > +  /* The widest broadcast size in bytes.  */
> > > +  unsigned int size;
> > > +  /* Load kind.  */
> > > +  x86_cse_kind kind;
> > > +};
> > > +
> > > +/* Return the inner scalar if OP is a broadcast, else return nullptr.  */
> > > +
> > > +static rtx
> > > +ix86_broadcast_inner (rtx op, machine_mode mode,
> > > +                     machine_mode *scalar_mode_p,
> > > +                     x86_cse_kind *kind_p, rtx_insn **insn_p)
> > > +{
> > > +  if (op == const0_rtx || op == CONST0_RTX (mode))
> > > +    {
> > > +      *scalar_mode_p = QImode;
> > > +      *kind_p = X86_CSE_CONST0_VECTOR;
> > > +      *insn_p = nullptr;
> > > +      return const0_rtx;
> > > +    }
> > > +  else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> > > +          && (op == constm1_rtx || op == CONSTM1_RTX (mode)))
> > > +    {
> > > +      *scalar_mode_p = QImode;
> > > +      *kind_p = X86_CSE_CONSTM1_VECTOR;
> > > +      *insn_p = nullptr;
> > > +      return constm1_rtx;
> > > +    }
> > > +
> > > +  mode = GET_MODE (op);
> > > +  int nunits = GET_MODE_NUNITS (mode);
> > > +  if (nunits < 2)
> > > +    return nullptr;
> > > +
> > > +  *kind_p = X86_CSE_VEC_DUP;
> > > +
> > > +  rtx reg;
> > > +  if (GET_CODE (op) == VEC_DUPLICATE)
> > > +    {
> > > +      /* Only
> > > +         (vec_duplicate:V4SI (reg:SI 99))
> > > +         (vec_duplicate:V2DF (mem/u/c:DF (symbol_ref/u:DI ("*.LC1") 
> > > [flags 0x2]) [0  S8 A64]))
> > > +        are supported.  */
> > > +      op = XEXP (op, 0);
> > > +      reg = op;
> > > +      if (SUBREG_P (op)
> > > +         && SUBREG_BYTE (op) == 0
> > > +         && !paradoxical_subreg_p (op))
> > > +       reg = SUBREG_REG (op);
> > > +      if (!REG_P (reg))
> > > +       {
> > > +         if (MEM_P (op)
> > > +             && SYMBOL_REF_P (XEXP (op, 0))
> > > +             && CONSTANT_POOL_ADDRESS_P (XEXP (op, 0)))
> > > +           {
> > > +             /* Handle constant broadcast from memory.  */
> > > +             *scalar_mode_p = GET_MODE_INNER (mode);
> > > +             *insn_p = nullptr;
> > > +             return op;
> > > +           }
> > > +         return nullptr;
> > > +       }
> > > +    }
> > > +  else if (CONST_VECTOR_P (op))
> > > +    {
> > > +      rtx first = XVECEXP (op, 0, 0);
> > > +      for (int i = 1; i < nunits; ++i)
> > > +       {
> > > +         rtx tmp = XVECEXP (op, 0, i);
> > > +         /* Vector duplicate value.  */
> > > +         if (!rtx_equal_p (tmp, first))
> > > +           return nullptr;
> > > +       }
> > > +      *scalar_mode_p = GET_MODE (first);
> > > +      *insn_p = nullptr;
> > > +      return first;
> > > +    }
> > > +  else
> > > +    return nullptr;
> > > +
> > > +  mode = GET_MODE (op);
> > > +
> > > +  /* Only single def chain is supported.  */
> > > +  df_ref ref = DF_REG_DEF_CHAIN (REGNO (reg));
> > > +  if (!ref
> > > +      || DF_REF_IS_ARTIFICIAL (ref)
> > > +      || DF_REF_NEXT_REG (ref) != nullptr)
> > > +    return nullptr;
> > > +
> > > +  rtx_insn *insn = DF_REF_INSN (ref);
> > > +  rtx set = single_set (insn);
> > > +  if (!set)
> > > +    return nullptr;
> > > +
> > > +  rtx dest = SET_DEST (set);
> > > +
> > > +  op = SET_SRC (set);
> > > +  /* Set *INSN_P if the scalar source isn't a constant nor an incoming
> > > +     argument.  */
> > > +  if (CONST_INT_P (op) || CONST_DOUBLE_P (op))
> > > +    *insn_p = nullptr;
> > > +  else if (REG_P (op)
> > > +          && REG_EXPR (op)
> > > +          && TREE_CODE (REG_EXPR (op)) == PARM_DECL)
> > > +    *insn_p = nullptr;
> > > +  else if (MEM_P (op)
> > > +          && MEM_EXPR (op)
> > > +          && TREE_CODE (get_base_address (MEM_EXPR (op))) == PARM_DECL)
> > > +    *insn_p = nullptr;
> > > +  else
> > > +    {
> > > +      while (SUBREG_P (dest))
> > > +       dest = SUBREG_REG (dest);
> > > +
> > > +      /* Skip if the SET destination mode doesn't match.  */
> > > +      if (GET_MODE (dest) != mode)
> > > +       return nullptr;
> >
> > Can we just require (dest == reg || dest == op), otherwise we need to
> > make sure GET_MODE of the original dest can cover mode of op(which is
> > more complicated, need to make sure SUBREG_BYTE is also zero???)
>
> I will change it to
>
>       /* Skip if the SET destination isn't the broadcast source.  */
>       if (dest != reg)
>         return nullptr;
>
> > > +
> > > +      /* Set the inner scalar to the SET destination. */
> > > +      op = dest;
> > > +      *insn_p = insn;
> > > +    }
> > > +
> > > +  *scalar_mode_p = mode;
> > > +  if (CONSTANT_P (op))
> > > +    *insn_p = nullptr;
> > > +  else
> > > +    *insn_p = insn;
> > Is this really needed? Do we have any case with vec_duplicate inner
>
> Yes.  For
>
>   s1 = __extension__(v4si){34, 34, 34, 34};
>
> we generate
>
> (insn 6 2 5 2 (set (reg:SI 99)
>         (const_int 34 [0x22])) "3.c":35:6 100 {*movsi_internal}
>      (nil))
> (insn 5 6 7 2 (set (reg:V4SI 98)
>         (vec_duplicate:V4SI (reg:SI 99))) "3.c":35:6 9240 {*vec_dupv4si}
>      (expr_list:REG_EQUAL (const_vector:V4SI [
>                 (const_int 34 [0x22]) repeated x4
>             ])
>         (nil)))
>
> We record it as
>
>  (set (reg:V4SI 98)
>         (vec_duplicate:V4SI (const_int 34 [0x22])))
>
> > constant instead of const_vector?
Oh, I missed op = SET_SRC (set).
Then shouldn't insn_p already be set as NULL by

 +  if (CONST_INT_P (op) || CONST_DOUBLE_P (op))
 +    *insn_p = nullptr;

> >
> > Others LGTM
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> H.J.



-- 
BR,
Hongtao

Reply via email to