On Mon, Jun 23, 2025 at 3:11 PM Hongtao Liu <crazy...@gmail.com> wrote: > > On Thu, Jun 19, 2025 at 10:25 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > Extend the remove_redundant_vector pass to handle vector broadcasts from > > constant and variable scalars. When broadcasting from constants and > > function arguments, we can place a single widest vector broadcast at > > entry of the nearest common dominator for basic blocks with all uses > > since constants and function arguments aren't changed. For broadcast > > from variables with a single definition, the single definition is > > replaced with the widest broadcast. > > > > gcc/ > > > > PR target/92080 > > * config/i386/i386-expand.cc (ix86_expand_call): Set > > recursive_function to true for recursive call. > > * config/i386/i386-features.cc (ix86_place_single_vector_set): > > Add an argument for inner scalar, default to nullptr. Set the > > source from inner scalar if not nullptr. > > (ix86_get_vector_load_mode): Renamed to ... > > (ix86_get_vector_cse_mode): This. Add an argument for scalar mode > > and handle integer and float scalar modes. > > (replace_vector_const): Add an argument for scalar mode and pass > > it to ix86_get_vector_load_mode. > > (x86_cse_kind): New. > > (redundant_load): Likewise. > > (ix86_broadcast_inner): Likewise. > > (remove_redundant_vector_load): Also support const0_rtx and > > constm1_rtx broadcasts. Handle vector broadcasts from constant > > and variable scalars. > > * config/i386/i386.h (machine_function): Add recursive_function. > > > > gcc/testsuite/ > > > > * gcc.target/i386/keylocker-aesdecwide128kl.c: Updated to expect > > movdqa instead pxor. > > * gcc.target/i386/keylocker-aesdecwide256kl.c: Likewise. > > * gcc.target/i386/keylocker-aesencwide128kl.c: Likewise. > > * gcc.target/i386/keylocker-aesencwide256kl.c: Likewise. > > * gcc.target/i386/pr92080-4.c: New test. > > * gcc.target/i386/pr92080-5.c: Likewise. > > * gcc.target/i386/pr92080-6.c: Likewise. > > * gcc.target/i386/pr92080-7.c: Likewise. > > * gcc.target/i386/pr92080-8.c: Likewise. > > * gcc.target/i386/pr92080-9.c: Likewise. > > * gcc.target/i386/pr92080-10.c: Likewise. > > * gcc.target/i386/pr92080-11.c: Likewise. > > * gcc.target/i386/pr92080-12.c: Likewise. > > * gcc.target/i386/pr92080-13.c: Likewise. > > * gcc.target/i386/pr92080-14.c: Likewise. > > * gcc.target/i386/pr92080-15.c: Likewise. > > * gcc.target/i386/pr92080-16.c: Likewise. > > > > Signed-off-by: H.J. Lu <hjl.to...@gmail.com> > > --- > > gcc/config/i386/i386-expand.cc | 3 + > > gcc/config/i386/i386-features.cc | 410 ++++++++++++++---- > > gcc/config/i386/i386.h | 3 + > > .../i386/keylocker-aesdecwide128kl.c | 14 +- > > .../i386/keylocker-aesdecwide256kl.c | 14 +- > > .../i386/keylocker-aesencwide128kl.c | 14 +- > > .../i386/keylocker-aesencwide256kl.c | 14 +- > > gcc/testsuite/gcc.target/i386/pr92080-10.c | 13 + > > gcc/testsuite/gcc.target/i386/pr92080-11.c | 33 ++ > > gcc/testsuite/gcc.target/i386/pr92080-12.c | 16 + > > gcc/testsuite/gcc.target/i386/pr92080-13.c | 32 ++ > > gcc/testsuite/gcc.target/i386/pr92080-14.c | 31 ++ > > gcc/testsuite/gcc.target/i386/pr92080-15.c | 25 ++ > > gcc/testsuite/gcc.target/i386/pr92080-16.c | 26 ++ > > gcc/testsuite/gcc.target/i386/pr92080-4.c | 50 +++ > > gcc/testsuite/gcc.target/i386/pr92080-5.c | 109 +++++ > > gcc/testsuite/gcc.target/i386/pr92080-6.c | 19 + > > gcc/testsuite/gcc.target/i386/pr92080-7.c | 20 + > > gcc/testsuite/gcc.target/i386/pr92080-8.c | 16 + > > gcc/testsuite/gcc.target/i386/pr92080-9.c | 81 ++++ > > 20 files changed, 823 insertions(+), 120 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-10.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-11.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-12.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-13.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-14.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-15.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-16.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-4.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-5.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-6.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-7.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-8.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-9.c > > > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > > index 4946f87a131..60c1ac5d2b1 100644 > > --- a/gcc/config/i386/i386-expand.cc > > +++ b/gcc/config/i386/i386-expand.cc > > @@ -10154,6 +10154,9 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx > > callarg1, > > else if (lookup_attribute ("no_callee_saved_registers", > > TYPE_ATTRIBUTES (TREE_TYPE (fndecl)))) > > call_no_callee_saved_registers = true; > > + if (fndecl == current_function_decl > > + && decl_binds_to_current_def_p (fndecl)) > > + cfun->machine->recursive_function = true; > > } > > } > > else > > diff --git a/gcc/config/i386/i386-features.cc > > b/gcc/config/i386/i386-features.cc > > index 56ab7f2d23b..2c49d11da8f 100644 > > --- a/gcc/config/i386/i386-features.cc > > +++ b/gcc/config/i386/i386-features.cc > > @@ -3088,10 +3088,12 @@ ix86_rpad_gate () > > /* Generate a vector set, DEST = SRC, at entry of the nearest dominator > > for basic block map BBS, which is in the fake loop that contains the > > whole function, so that there is only a single vector set in the > > - whole function. */ > > + whole function. If not nullptr, INNER_SCALAR is the inner scalar of > > + SRC, as (reg:SI 99) in (vec_duplicate:V4SI (reg:SI 99)). */ > > > > static void > > -ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs) > > +ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs, > > + rtx inner_scalar = nullptr) > > { > > basic_block bb = nearest_common_dominator_for_set (CDI_DOMINATORS, bbs); > > while (bb->loop_father->latch > > @@ -3112,10 +3114,23 @@ ix86_place_single_vector_set (rtx dest, rtx src, > > bitmap bbs) > > insn = NEXT_INSN (insn); > > } > > > > + rtx_insn *set_insn; > > if (insn == BB_HEAD (bb)) > > - emit_insn_before (set, insn); > > + set_insn = emit_insn_before (set, insn); > > else > > - emit_insn_after (set, insn ? PREV_INSN (insn) : BB_END (bb)); > > + set_insn = emit_insn_after (set, > > + insn ? PREV_INSN (insn) : BB_END (bb)); > > + > > + if (inner_scalar) > > + { > > + /* Set the source in (vec_duplicate:V4SI (reg:SI 99)). */ > > + rtx reg = XEXP (src, 0); > > + if ((REG_P (inner_scalar) || MEM_P (inner_scalar)) > > + && GET_MODE (reg) != GET_MODE (inner_scalar)) > > + inner_scalar = gen_rtx_SUBREG (GET_MODE (reg), inner_scalar, 0); > > + rtx set = gen_rtx_SET (reg, inner_scalar); > > + emit_insn_before (set, set_insn); > > + } > > } > > > > /* At entry of the nearest common dominator for basic blocks with > > @@ -3346,26 +3361,15 @@ make_pass_remove_partial_avx_dependency > > (gcc::context *ctxt) > > return new pass_remove_partial_avx_dependency (ctxt); > > } > > > > -/* Return a machine mode suitable for vector SIZE. */ > > +/* Return a machine mode suitable for vector SIZE with SMODE inner > > + mode. */ > > > > static machine_mode > > -ix86_get_vector_load_mode (unsigned int size) > > +ix86_get_vector_cse_mode (unsigned int size, machine_mode smode) > > { > > - machine_mode mode; > > - if (size == 64) > > - mode = V64QImode; > > - else if (size == 32) > > - mode = V32QImode; > > - else if (size == 16) > > - mode = V16QImode; > > - else if (size == 8) > > - mode = V8QImode; > > - else if (size == 4) > > - mode = V4QImode; > > - else if (size == 2) > > - mode = V2QImode; > > - else > > - gcc_unreachable (); > > + scalar_mode s_mode = as_a <scalar_mode> (smode); > > + poly_uint64 nunits = size / GET_MODE_SIZE (smode); > > + machine_mode mode = mode_for_vector (s_mode, nunits).require (); > > return mode; > > } > > > > @@ -3374,7 +3378,8 @@ ix86_get_vector_load_mode (unsigned int size) > > > > static void > > replace_vector_const (machine_mode vector_mode, rtx vector_const, > > - auto_bitmap &vector_insns) > > + auto_bitmap &vector_insns, > > + machine_mode scalar_mode) > > { > > bitmap_iterator bi; > > unsigned int id; > > @@ -3386,7 +3391,8 @@ replace_vector_const (machine_mode vector_mode, rtx > > vector_const, > > /* Get the single SET instruction. */ > > rtx set = single_set (insn); > > rtx src = SET_SRC (set); > > - machine_mode mode = GET_MODE (src); > > + rtx dest = SET_DEST (set); > > + machine_mode mode = GET_MODE (dest); > > > > rtx replace; > > /* Replace the source operand with VECTOR_CONST. */ > > @@ -3400,7 +3406,8 @@ replace_vector_const (machine_mode vector_mode, rtx > > vector_const, > > /* If the mode size is smaller than its natural size, > > first insert an extra move with a QI vector SUBREG > > of the same size to avoid validate_subreg failure. */ > > - machine_mode vmode = ix86_get_vector_load_mode (size); > > + machine_mode vmode > > + = ix86_get_vector_cse_mode (size, scalar_mode); > > rtx vreg; > > if (mode == vmode) > > vreg = vector_const; > > @@ -3426,6 +3433,169 @@ replace_vector_const (machine_mode vector_mode, rtx > > vector_const, > > } > > } > > > > +enum x86_cse_kind > > +{ > > + X86_CSE_CONST0_VECTOR, > > + X86_CSE_CONSTM1_VECTOR, > > + X86_CSE_VEC_DUP > > +}; > > + > > +struct redundant_load > > +{ > > + /* Bitmap of basic blocks with broadcast instructions. */ > > + auto_bitmap bbs; > > + /* Bitmap of broadcast instructions. */ > > + auto_bitmap insns; > > + /* The broadcast inner scalar. */ > > + rtx val; > > + /* The inner scalar mode. */ > > + machine_mode mode; > > + /* The instruction which sets the inner scalar. Nullptr if the inner > > + scalar is applied to the whole function, instead of within the same > > + block. */ > > + rtx_insn *def_insn; > > + /* The widest broadcast source. */ > > + rtx broadcast_source; > > + /* The widest broadcast register. */ > > + rtx broadcast_reg; > > + /* The basic block of the broadcast instruction. */ > > + basic_block bb; > > + /* The number of broadcast instructions with the same inner scalar. */ > > + unsigned HOST_WIDE_INT count; > > + /* The threshold of broadcast instructions with the same inner > > + scalar. */ > > + unsigned int threshold; > > + /* The widest broadcast size in bytes. */ > > + unsigned int size; > > + /* Load kind. */ > > + x86_cse_kind kind; > > +}; > > + > > +/* Return the inner scalar if OP is a broadcast, else return nullptr. */ > > + > > +static rtx > > +ix86_broadcast_inner (rtx op, machine_mode mode, > > + machine_mode *scalar_mode_p, > > + x86_cse_kind *kind_p, rtx_insn **insn_p) > > +{ > > + if (op == const0_rtx || op == CONST0_RTX (mode)) > > + { > > + *scalar_mode_p = QImode; > > + *kind_p = X86_CSE_CONST0_VECTOR; > > + *insn_p = nullptr; > > + return const0_rtx; > > + } > > + else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT > > + && (op == constm1_rtx || op == CONSTM1_RTX (mode))) > > + { > > + *scalar_mode_p = QImode; > > + *kind_p = X86_CSE_CONSTM1_VECTOR; > > + *insn_p = nullptr; > > + return constm1_rtx; > > + } > > + > > + mode = GET_MODE (op); > > + int nunits = GET_MODE_NUNITS (mode); > > + if (nunits < 2) > > + return nullptr; > > + > > + *kind_p = X86_CSE_VEC_DUP; > > + > > + rtx reg; > > + if (GET_CODE (op) == VEC_DUPLICATE) > > + { > > + /* Only > > + (vec_duplicate:V4SI (reg:SI 99)) > > + (vec_duplicate:V2DF (mem/u/c:DF (symbol_ref/u:DI ("*.LC1") [flags > > 0x2]) [0 S8 A64])) > > + are supported. */ > > + op = XEXP (op, 0); > > + reg = op; > > + if (SUBREG_P (op) > > + && SUBREG_BYTE (op) == 0 > > + && !paradoxical_subreg_p (op)) > > + reg = SUBREG_REG (op); > > + if (!REG_P (reg)) > > + { > > + if (MEM_P (op) > > + && SYMBOL_REF_P (XEXP (op, 0)) > > + && CONSTANT_POOL_ADDRESS_P (XEXP (op, 0))) > > + { > > + /* Handle constant broadcast from memory. */ > > + *scalar_mode_p = GET_MODE_INNER (mode); > > + *insn_p = nullptr; > > + return op; > > + } > > + return nullptr; > > + } > > + } > > + else if (CONST_VECTOR_P (op)) > > + { > > + rtx first = XVECEXP (op, 0, 0); > > + for (int i = 1; i < nunits; ++i) > > + { > > + rtx tmp = XVECEXP (op, 0, i); > > + /* Vector duplicate value. */ > > + if (!rtx_equal_p (tmp, first)) > > + return nullptr; > > + } > > + *scalar_mode_p = GET_MODE (first); > > + *insn_p = nullptr; > > + return first; > > + } > > + else > > + return nullptr; > > + > > + mode = GET_MODE (op); > > + > > + /* Only single def chain is supported. */ > > + df_ref ref = DF_REG_DEF_CHAIN (REGNO (reg)); > > + if (!ref > > + || DF_REF_IS_ARTIFICIAL (ref) > > + || DF_REF_NEXT_REG (ref) != nullptr) > > + return nullptr; > > + > > + rtx_insn *insn = DF_REF_INSN (ref); > > + rtx set = single_set (insn); > > + if (!set) > > + return nullptr; > > + > > + rtx dest = SET_DEST (set); > > + > > + op = SET_SRC (set); > > + /* Set *INSN_P if the scalar source isn't a constant nor an incoming > > + argument. */ > > + if (CONST_INT_P (op) || CONST_DOUBLE_P (op)) > > + *insn_p = nullptr; > > + else if (REG_P (op) > > + && REG_EXPR (op) > > + && TREE_CODE (REG_EXPR (op)) == PARM_DECL) > > + *insn_p = nullptr; > > + else if (MEM_P (op) > > + && MEM_EXPR (op) > > + && TREE_CODE (get_base_address (MEM_EXPR (op))) == PARM_DECL) > > + *insn_p = nullptr; > > + else > > + { > > + while (SUBREG_P (dest)) > > + dest = SUBREG_REG (dest); > > + > > + /* Skip if the SET destination mode doesn't match. */ > > + if (GET_MODE (dest) != mode) > > + return nullptr; > > Can we just require (dest == reg || dest == op), otherwise we need to > make sure GET_MODE of the original dest can cover mode of op(which is > more complicated, need to make sure SUBREG_BYTE is also zero???)
I will change it to /* Skip if the SET destination isn't the broadcast source. */ if (dest != reg) return nullptr; > > + > > + /* Set the inner scalar to the SET destination. */ > > + op = dest; > > + *insn_p = insn; > > + } > > + > > + *scalar_mode_p = mode; > > + if (CONSTANT_P (op)) > > + *insn_p = nullptr; > > + else > > + *insn_p = insn; > Is this really needed? Do we have any case with vec_duplicate inner Yes. For s1 = __extension__(v4si){34, 34, 34, 34}; we generate (insn 6 2 5 2 (set (reg:SI 99) (const_int 34 [0x22])) "3.c":35:6 100 {*movsi_internal} (nil)) (insn 5 6 7 2 (set (reg:V4SI 98) (vec_duplicate:V4SI (reg:SI 99))) "3.c":35:6 9240 {*vec_dupv4si} (expr_list:REG_EQUAL (const_vector:V4SI [ (const_int 34 [0x22]) repeated x4 ]) (nil))) We record it as (set (reg:V4SI 98) (vec_duplicate:V4SI (const_int 34 [0x22]))) > constant instead of const_vector? > > Others LGTM > > > > -- > BR, > Hongtao -- H.J.