On Thu, Jun 19, 2025 at 10:25 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > Extend the remove_redundant_vector pass to handle vector broadcasts from > constant and variable scalars. When broadcasting from constants and > function arguments, we can place a single widest vector broadcast at > entry of the nearest common dominator for basic blocks with all uses > since constants and function arguments aren't changed. For broadcast > from variables with a single definition, the single definition is > replaced with the widest broadcast. > > gcc/ > > PR target/92080 > * config/i386/i386-expand.cc (ix86_expand_call): Set > recursive_function to true for recursive call. > * config/i386/i386-features.cc (ix86_place_single_vector_set): > Add an argument for inner scalar, default to nullptr. Set the > source from inner scalar if not nullptr. > (ix86_get_vector_load_mode): Renamed to ... > (ix86_get_vector_cse_mode): This. Add an argument for scalar mode > and handle integer and float scalar modes. > (replace_vector_const): Add an argument for scalar mode and pass > it to ix86_get_vector_load_mode. > (x86_cse_kind): New. > (redundant_load): Likewise. > (ix86_broadcast_inner): Likewise. > (remove_redundant_vector_load): Also support const0_rtx and > constm1_rtx broadcasts. Handle vector broadcasts from constant > and variable scalars. > * config/i386/i386.h (machine_function): Add recursive_function. > > gcc/testsuite/ > > * gcc.target/i386/keylocker-aesdecwide128kl.c: Updated to expect > movdqa instead pxor. > * gcc.target/i386/keylocker-aesdecwide256kl.c: Likewise. > * gcc.target/i386/keylocker-aesencwide128kl.c: Likewise. > * gcc.target/i386/keylocker-aesencwide256kl.c: Likewise. > * gcc.target/i386/pr92080-4.c: New test. > * gcc.target/i386/pr92080-5.c: Likewise. > * gcc.target/i386/pr92080-6.c: Likewise. > * gcc.target/i386/pr92080-7.c: Likewise. > * gcc.target/i386/pr92080-8.c: Likewise. > * gcc.target/i386/pr92080-9.c: Likewise. > * gcc.target/i386/pr92080-10.c: Likewise. > * gcc.target/i386/pr92080-11.c: Likewise. > * gcc.target/i386/pr92080-12.c: Likewise. > * gcc.target/i386/pr92080-13.c: Likewise. > * gcc.target/i386/pr92080-14.c: Likewise. > * gcc.target/i386/pr92080-15.c: Likewise. > * gcc.target/i386/pr92080-16.c: Likewise. > > Signed-off-by: H.J. Lu <hjl.to...@gmail.com> > --- > gcc/config/i386/i386-expand.cc | 3 + > gcc/config/i386/i386-features.cc | 410 ++++++++++++++---- > gcc/config/i386/i386.h | 3 + > .../i386/keylocker-aesdecwide128kl.c | 14 +- > .../i386/keylocker-aesdecwide256kl.c | 14 +- > .../i386/keylocker-aesencwide128kl.c | 14 +- > .../i386/keylocker-aesencwide256kl.c | 14 +- > gcc/testsuite/gcc.target/i386/pr92080-10.c | 13 + > gcc/testsuite/gcc.target/i386/pr92080-11.c | 33 ++ > gcc/testsuite/gcc.target/i386/pr92080-12.c | 16 + > gcc/testsuite/gcc.target/i386/pr92080-13.c | 32 ++ > gcc/testsuite/gcc.target/i386/pr92080-14.c | 31 ++ > gcc/testsuite/gcc.target/i386/pr92080-15.c | 25 ++ > gcc/testsuite/gcc.target/i386/pr92080-16.c | 26 ++ > gcc/testsuite/gcc.target/i386/pr92080-4.c | 50 +++ > gcc/testsuite/gcc.target/i386/pr92080-5.c | 109 +++++ > gcc/testsuite/gcc.target/i386/pr92080-6.c | 19 + > gcc/testsuite/gcc.target/i386/pr92080-7.c | 20 + > gcc/testsuite/gcc.target/i386/pr92080-8.c | 16 + > gcc/testsuite/gcc.target/i386/pr92080-9.c | 81 ++++ > 20 files changed, 823 insertions(+), 120 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-10.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-11.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-12.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-13.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-14.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-15.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-16.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-4.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-5.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-6.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-7.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-8.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-9.c > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > index 4946f87a131..60c1ac5d2b1 100644 > --- a/gcc/config/i386/i386-expand.cc > +++ b/gcc/config/i386/i386-expand.cc > @@ -10154,6 +10154,9 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx > callarg1, > else if (lookup_attribute ("no_callee_saved_registers", > TYPE_ATTRIBUTES (TREE_TYPE (fndecl)))) > call_no_callee_saved_registers = true; > + if (fndecl == current_function_decl > + && decl_binds_to_current_def_p (fndecl)) > + cfun->machine->recursive_function = true; > } > } > else > diff --git a/gcc/config/i386/i386-features.cc > b/gcc/config/i386/i386-features.cc > index 56ab7f2d23b..2c49d11da8f 100644 > --- a/gcc/config/i386/i386-features.cc > +++ b/gcc/config/i386/i386-features.cc > @@ -3088,10 +3088,12 @@ ix86_rpad_gate () > /* Generate a vector set, DEST = SRC, at entry of the nearest dominator > for basic block map BBS, which is in the fake loop that contains the > whole function, so that there is only a single vector set in the > - whole function. */ > + whole function. If not nullptr, INNER_SCALAR is the inner scalar of > + SRC, as (reg:SI 99) in (vec_duplicate:V4SI (reg:SI 99)). */ > > static void > -ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs) > +ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs, > + rtx inner_scalar = nullptr) > { > basic_block bb = nearest_common_dominator_for_set (CDI_DOMINATORS, bbs); > while (bb->loop_father->latch > @@ -3112,10 +3114,23 @@ ix86_place_single_vector_set (rtx dest, rtx src, > bitmap bbs) > insn = NEXT_INSN (insn); > } > > + rtx_insn *set_insn; > if (insn == BB_HEAD (bb)) > - emit_insn_before (set, insn); > + set_insn = emit_insn_before (set, insn); > else > - emit_insn_after (set, insn ? PREV_INSN (insn) : BB_END (bb)); > + set_insn = emit_insn_after (set, > + insn ? PREV_INSN (insn) : BB_END (bb)); > + > + if (inner_scalar) > + { > + /* Set the source in (vec_duplicate:V4SI (reg:SI 99)). */ > + rtx reg = XEXP (src, 0); > + if ((REG_P (inner_scalar) || MEM_P (inner_scalar)) > + && GET_MODE (reg) != GET_MODE (inner_scalar)) > + inner_scalar = gen_rtx_SUBREG (GET_MODE (reg), inner_scalar, 0); > + rtx set = gen_rtx_SET (reg, inner_scalar); > + emit_insn_before (set, set_insn); > + } > } > > /* At entry of the nearest common dominator for basic blocks with > @@ -3346,26 +3361,15 @@ make_pass_remove_partial_avx_dependency (gcc::context > *ctxt) > return new pass_remove_partial_avx_dependency (ctxt); > } > > -/* Return a machine mode suitable for vector SIZE. */ > +/* Return a machine mode suitable for vector SIZE with SMODE inner > + mode. */ > > static machine_mode > -ix86_get_vector_load_mode (unsigned int size) > +ix86_get_vector_cse_mode (unsigned int size, machine_mode smode) > { > - machine_mode mode; > - if (size == 64) > - mode = V64QImode; > - else if (size == 32) > - mode = V32QImode; > - else if (size == 16) > - mode = V16QImode; > - else if (size == 8) > - mode = V8QImode; > - else if (size == 4) > - mode = V4QImode; > - else if (size == 2) > - mode = V2QImode; > - else > - gcc_unreachable (); > + scalar_mode s_mode = as_a <scalar_mode> (smode); > + poly_uint64 nunits = size / GET_MODE_SIZE (smode); > + machine_mode mode = mode_for_vector (s_mode, nunits).require (); > return mode; > } > > @@ -3374,7 +3378,8 @@ ix86_get_vector_load_mode (unsigned int size) > > static void > replace_vector_const (machine_mode vector_mode, rtx vector_const, > - auto_bitmap &vector_insns) > + auto_bitmap &vector_insns, > + machine_mode scalar_mode) > { > bitmap_iterator bi; > unsigned int id; > @@ -3386,7 +3391,8 @@ replace_vector_const (machine_mode vector_mode, rtx > vector_const, > /* Get the single SET instruction. */ > rtx set = single_set (insn); > rtx src = SET_SRC (set); > - machine_mode mode = GET_MODE (src); > + rtx dest = SET_DEST (set); > + machine_mode mode = GET_MODE (dest); > > rtx replace; > /* Replace the source operand with VECTOR_CONST. */ > @@ -3400,7 +3406,8 @@ replace_vector_const (machine_mode vector_mode, rtx > vector_const, > /* If the mode size is smaller than its natural size, > first insert an extra move with a QI vector SUBREG > of the same size to avoid validate_subreg failure. */ > - machine_mode vmode = ix86_get_vector_load_mode (size); > + machine_mode vmode > + = ix86_get_vector_cse_mode (size, scalar_mode); > rtx vreg; > if (mode == vmode) > vreg = vector_const; > @@ -3426,6 +3433,169 @@ replace_vector_const (machine_mode vector_mode, rtx > vector_const, > } > } > > +enum x86_cse_kind > +{ > + X86_CSE_CONST0_VECTOR, > + X86_CSE_CONSTM1_VECTOR, > + X86_CSE_VEC_DUP > +}; > + > +struct redundant_load > +{ > + /* Bitmap of basic blocks with broadcast instructions. */ > + auto_bitmap bbs; > + /* Bitmap of broadcast instructions. */ > + auto_bitmap insns; > + /* The broadcast inner scalar. */ > + rtx val; > + /* The inner scalar mode. */ > + machine_mode mode; > + /* The instruction which sets the inner scalar. Nullptr if the inner > + scalar is applied to the whole function, instead of within the same > + block. */ > + rtx_insn *def_insn; > + /* The widest broadcast source. */ > + rtx broadcast_source; > + /* The widest broadcast register. */ > + rtx broadcast_reg; > + /* The basic block of the broadcast instruction. */ > + basic_block bb; > + /* The number of broadcast instructions with the same inner scalar. */ > + unsigned HOST_WIDE_INT count; > + /* The threshold of broadcast instructions with the same inner > + scalar. */ > + unsigned int threshold; > + /* The widest broadcast size in bytes. */ > + unsigned int size; > + /* Load kind. */ > + x86_cse_kind kind; > +}; > + > +/* Return the inner scalar if OP is a broadcast, else return nullptr. */ > + > +static rtx > +ix86_broadcast_inner (rtx op, machine_mode mode, > + machine_mode *scalar_mode_p, > + x86_cse_kind *kind_p, rtx_insn **insn_p) > +{ > + if (op == const0_rtx || op == CONST0_RTX (mode)) > + { > + *scalar_mode_p = QImode; > + *kind_p = X86_CSE_CONST0_VECTOR; > + *insn_p = nullptr; > + return const0_rtx; > + } > + else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT > + && (op == constm1_rtx || op == CONSTM1_RTX (mode))) > + { > + *scalar_mode_p = QImode; > + *kind_p = X86_CSE_CONSTM1_VECTOR; > + *insn_p = nullptr; > + return constm1_rtx; > + } > + > + mode = GET_MODE (op); > + int nunits = GET_MODE_NUNITS (mode); > + if (nunits < 2) > + return nullptr; > + > + *kind_p = X86_CSE_VEC_DUP; > + > + rtx reg; > + if (GET_CODE (op) == VEC_DUPLICATE) > + { > + /* Only > + (vec_duplicate:V4SI (reg:SI 99)) > + (vec_duplicate:V2DF (mem/u/c:DF (symbol_ref/u:DI ("*.LC1") [flags > 0x2]) [0 S8 A64])) > + are supported. */ > + op = XEXP (op, 0); > + reg = op; > + if (SUBREG_P (op) > + && SUBREG_BYTE (op) == 0 > + && !paradoxical_subreg_p (op)) > + reg = SUBREG_REG (op); > + if (!REG_P (reg)) > + { > + if (MEM_P (op) > + && SYMBOL_REF_P (XEXP (op, 0)) > + && CONSTANT_POOL_ADDRESS_P (XEXP (op, 0))) > + { > + /* Handle constant broadcast from memory. */ > + *scalar_mode_p = GET_MODE_INNER (mode); > + *insn_p = nullptr; > + return op; > + } > + return nullptr; > + } > + } > + else if (CONST_VECTOR_P (op)) > + { > + rtx first = XVECEXP (op, 0, 0); > + for (int i = 1; i < nunits; ++i) > + { > + rtx tmp = XVECEXP (op, 0, i); > + /* Vector duplicate value. */ > + if (!rtx_equal_p (tmp, first)) > + return nullptr; > + } > + *scalar_mode_p = GET_MODE (first); > + *insn_p = nullptr; > + return first; > + } > + else > + return nullptr; > + > + mode = GET_MODE (op); > + > + /* Only single def chain is supported. */ > + df_ref ref = DF_REG_DEF_CHAIN (REGNO (reg)); > + if (!ref > + || DF_REF_IS_ARTIFICIAL (ref) > + || DF_REF_NEXT_REG (ref) != nullptr) > + return nullptr; > + > + rtx_insn *insn = DF_REF_INSN (ref); > + rtx set = single_set (insn); > + if (!set) > + return nullptr; > + > + rtx dest = SET_DEST (set); > + > + op = SET_SRC (set); > + /* Set *INSN_P if the scalar source isn't a constant nor an incoming > + argument. */ > + if (CONST_INT_P (op) || CONST_DOUBLE_P (op)) > + *insn_p = nullptr; > + else if (REG_P (op) > + && REG_EXPR (op) > + && TREE_CODE (REG_EXPR (op)) == PARM_DECL) > + *insn_p = nullptr; > + else if (MEM_P (op) > + && MEM_EXPR (op) > + && TREE_CODE (get_base_address (MEM_EXPR (op))) == PARM_DECL) > + *insn_p = nullptr; > + else > + { > + while (SUBREG_P (dest)) > + dest = SUBREG_REG (dest); > + > + /* Skip if the SET destination mode doesn't match. */ > + if (GET_MODE (dest) != mode) > + return nullptr;
Can we just require (dest == reg || dest == op), otherwise we need to make sure GET_MODE of the original dest can cover mode of op(which is more complicated, need to make sure SUBREG_BYTE is also zero???) > + > + /* Set the inner scalar to the SET destination. */ > + op = dest; > + *insn_p = insn; > + } > + > + *scalar_mode_p = mode; > + if (CONSTANT_P (op)) > + *insn_p = nullptr; > + else > + *insn_p = insn; Is this really needed? Do we have any case with vec_duplicate inner constant instead of const_vector? Others LGTM -- BR, Hongtao