On Mon, Jun 23, 2025 at 4:10 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > On Mon, Jun 23, 2025 at 3:11 PM Hongtao Liu <crazy...@gmail.com> wrote: > > > > On Thu, Jun 19, 2025 at 10:25 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > Extend the remove_redundant_vector pass to handle vector broadcasts from > > > constant and variable scalars. When broadcasting from constants and > > > function arguments, we can place a single widest vector broadcast at > > > entry of the nearest common dominator for basic blocks with all uses > > > since constants and function arguments aren't changed. For broadcast > > > from variables with a single definition, the single definition is > > > replaced with the widest broadcast. > > > > > > gcc/ > > > > > > PR target/92080 > > > * config/i386/i386-expand.cc (ix86_expand_call): Set > > > recursive_function to true for recursive call. > > > * config/i386/i386-features.cc (ix86_place_single_vector_set): > > > Add an argument for inner scalar, default to nullptr. Set the > > > source from inner scalar if not nullptr. > > > (ix86_get_vector_load_mode): Renamed to ... > > > (ix86_get_vector_cse_mode): This. Add an argument for scalar mode > > > and handle integer and float scalar modes. > > > (replace_vector_const): Add an argument for scalar mode and pass > > > it to ix86_get_vector_load_mode. > > > (x86_cse_kind): New. > > > (redundant_load): Likewise. > > > (ix86_broadcast_inner): Likewise. > > > (remove_redundant_vector_load): Also support const0_rtx and > > > constm1_rtx broadcasts. Handle vector broadcasts from constant > > > and variable scalars. > > > * config/i386/i386.h (machine_function): Add recursive_function. > > > > > > gcc/testsuite/ > > > > > > * gcc.target/i386/keylocker-aesdecwide128kl.c: Updated to expect > > > movdqa instead pxor. > > > * gcc.target/i386/keylocker-aesdecwide256kl.c: Likewise. > > > * gcc.target/i386/keylocker-aesencwide128kl.c: Likewise. > > > * gcc.target/i386/keylocker-aesencwide256kl.c: Likewise. > > > * gcc.target/i386/pr92080-4.c: New test. > > > * gcc.target/i386/pr92080-5.c: Likewise. > > > * gcc.target/i386/pr92080-6.c: Likewise. > > > * gcc.target/i386/pr92080-7.c: Likewise. > > > * gcc.target/i386/pr92080-8.c: Likewise. > > > * gcc.target/i386/pr92080-9.c: Likewise. > > > * gcc.target/i386/pr92080-10.c: Likewise. > > > * gcc.target/i386/pr92080-11.c: Likewise. > > > * gcc.target/i386/pr92080-12.c: Likewise. > > > * gcc.target/i386/pr92080-13.c: Likewise. > > > * gcc.target/i386/pr92080-14.c: Likewise. > > > * gcc.target/i386/pr92080-15.c: Likewise. > > > * gcc.target/i386/pr92080-16.c: Likewise. > > > > > > Signed-off-by: H.J. Lu <hjl.to...@gmail.com> > > > --- > > > gcc/config/i386/i386-expand.cc | 3 + > > > gcc/config/i386/i386-features.cc | 410 ++++++++++++++---- > > > gcc/config/i386/i386.h | 3 + > > > .../i386/keylocker-aesdecwide128kl.c | 14 +- > > > .../i386/keylocker-aesdecwide256kl.c | 14 +- > > > .../i386/keylocker-aesencwide128kl.c | 14 +- > > > .../i386/keylocker-aesencwide256kl.c | 14 +- > > > gcc/testsuite/gcc.target/i386/pr92080-10.c | 13 + > > > gcc/testsuite/gcc.target/i386/pr92080-11.c | 33 ++ > > > gcc/testsuite/gcc.target/i386/pr92080-12.c | 16 + > > > gcc/testsuite/gcc.target/i386/pr92080-13.c | 32 ++ > > > gcc/testsuite/gcc.target/i386/pr92080-14.c | 31 ++ > > > gcc/testsuite/gcc.target/i386/pr92080-15.c | 25 ++ > > > gcc/testsuite/gcc.target/i386/pr92080-16.c | 26 ++ > > > gcc/testsuite/gcc.target/i386/pr92080-4.c | 50 +++ > > > gcc/testsuite/gcc.target/i386/pr92080-5.c | 109 +++++ > > > gcc/testsuite/gcc.target/i386/pr92080-6.c | 19 + > > > gcc/testsuite/gcc.target/i386/pr92080-7.c | 20 + > > > gcc/testsuite/gcc.target/i386/pr92080-8.c | 16 + > > > gcc/testsuite/gcc.target/i386/pr92080-9.c | 81 ++++ > > > 20 files changed, 823 insertions(+), 120 deletions(-) > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-10.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-11.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-12.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-13.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-14.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-15.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-16.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-4.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-5.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-6.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-7.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-8.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-9.c > > > > > > diff --git a/gcc/config/i386/i386-expand.cc > > > b/gcc/config/i386/i386-expand.cc > > > index 4946f87a131..60c1ac5d2b1 100644 > > > --- a/gcc/config/i386/i386-expand.cc > > > +++ b/gcc/config/i386/i386-expand.cc > > > @@ -10154,6 +10154,9 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx > > > callarg1, > > > else if (lookup_attribute ("no_callee_saved_registers", > > > TYPE_ATTRIBUTES (TREE_TYPE > > > (fndecl)))) > > > call_no_callee_saved_registers = true; > > > + if (fndecl == current_function_decl > > > + && decl_binds_to_current_def_p (fndecl)) > > > + cfun->machine->recursive_function = true; > > > } > > > } > > > else > > > diff --git a/gcc/config/i386/i386-features.cc > > > b/gcc/config/i386/i386-features.cc > > > index 56ab7f2d23b..2c49d11da8f 100644 > > > --- a/gcc/config/i386/i386-features.cc > > > +++ b/gcc/config/i386/i386-features.cc > > > @@ -3088,10 +3088,12 @@ ix86_rpad_gate () > > > /* Generate a vector set, DEST = SRC, at entry of the nearest dominator > > > for basic block map BBS, which is in the fake loop that contains the > > > whole function, so that there is only a single vector set in the > > > - whole function. */ > > > + whole function. If not nullptr, INNER_SCALAR is the inner scalar of > > > + SRC, as (reg:SI 99) in (vec_duplicate:V4SI (reg:SI 99)). */ > > > > > > static void > > > -ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs) > > > +ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs, > > > + rtx inner_scalar = nullptr) > > > { > > > basic_block bb = nearest_common_dominator_for_set (CDI_DOMINATORS, > > > bbs); > > > while (bb->loop_father->latch > > > @@ -3112,10 +3114,23 @@ ix86_place_single_vector_set (rtx dest, rtx src, > > > bitmap bbs) > > > insn = NEXT_INSN (insn); > > > } > > > > > > + rtx_insn *set_insn; > > > if (insn == BB_HEAD (bb)) > > > - emit_insn_before (set, insn); > > > + set_insn = emit_insn_before (set, insn); > > > else > > > - emit_insn_after (set, insn ? PREV_INSN (insn) : BB_END (bb)); > > > + set_insn = emit_insn_after (set, > > > + insn ? PREV_INSN (insn) : BB_END (bb)); > > > + > > > + if (inner_scalar) > > > + { > > > + /* Set the source in (vec_duplicate:V4SI (reg:SI 99)). */ > > > + rtx reg = XEXP (src, 0); > > > + if ((REG_P (inner_scalar) || MEM_P (inner_scalar)) > > > + && GET_MODE (reg) != GET_MODE (inner_scalar)) > > > + inner_scalar = gen_rtx_SUBREG (GET_MODE (reg), inner_scalar, 0); > > > + rtx set = gen_rtx_SET (reg, inner_scalar); > > > + emit_insn_before (set, set_insn); > > > + } > > > } > > > > > > /* At entry of the nearest common dominator for basic blocks with > > > @@ -3346,26 +3361,15 @@ make_pass_remove_partial_avx_dependency > > > (gcc::context *ctxt) > > > return new pass_remove_partial_avx_dependency (ctxt); > > > } > > > > > > -/* Return a machine mode suitable for vector SIZE. */ > > > +/* Return a machine mode suitable for vector SIZE with SMODE inner > > > + mode. */ > > > > > > static machine_mode > > > -ix86_get_vector_load_mode (unsigned int size) > > > +ix86_get_vector_cse_mode (unsigned int size, machine_mode smode) > > > { > > > - machine_mode mode; > > > - if (size == 64) > > > - mode = V64QImode; > > > - else if (size == 32) > > > - mode = V32QImode; > > > - else if (size == 16) > > > - mode = V16QImode; > > > - else if (size == 8) > > > - mode = V8QImode; > > > - else if (size == 4) > > > - mode = V4QImode; > > > - else if (size == 2) > > > - mode = V2QImode; > > > - else > > > - gcc_unreachable (); > > > + scalar_mode s_mode = as_a <scalar_mode> (smode); > > > + poly_uint64 nunits = size / GET_MODE_SIZE (smode); > > > + machine_mode mode = mode_for_vector (s_mode, nunits).require (); > > > return mode; > > > } > > > > > > @@ -3374,7 +3378,8 @@ ix86_get_vector_load_mode (unsigned int size) > > > > > > static void > > > replace_vector_const (machine_mode vector_mode, rtx vector_const, > > > - auto_bitmap &vector_insns) > > > + auto_bitmap &vector_insns, > > > + machine_mode scalar_mode) > > > { > > > bitmap_iterator bi; > > > unsigned int id; > > > @@ -3386,7 +3391,8 @@ replace_vector_const (machine_mode vector_mode, rtx > > > vector_const, > > > /* Get the single SET instruction. */ > > > rtx set = single_set (insn); > > > rtx src = SET_SRC (set); > > > - machine_mode mode = GET_MODE (src); > > > + rtx dest = SET_DEST (set); > > > + machine_mode mode = GET_MODE (dest); > > > > > > rtx replace; > > > /* Replace the source operand with VECTOR_CONST. */ > > > @@ -3400,7 +3406,8 @@ replace_vector_const (machine_mode vector_mode, rtx > > > vector_const, > > > /* If the mode size is smaller than its natural size, > > > first insert an extra move with a QI vector SUBREG > > > of the same size to avoid validate_subreg failure. */ > > > - machine_mode vmode = ix86_get_vector_load_mode (size); > > > + machine_mode vmode > > > + = ix86_get_vector_cse_mode (size, scalar_mode); > > > rtx vreg; > > > if (mode == vmode) > > > vreg = vector_const; > > > @@ -3426,6 +3433,169 @@ replace_vector_const (machine_mode vector_mode, > > > rtx vector_const, > > > } > > > } > > > > > > +enum x86_cse_kind > > > +{ > > > + X86_CSE_CONST0_VECTOR, > > > + X86_CSE_CONSTM1_VECTOR, > > > + X86_CSE_VEC_DUP > > > +}; > > > + > > > +struct redundant_load > > > +{ > > > + /* Bitmap of basic blocks with broadcast instructions. */ > > > + auto_bitmap bbs; > > > + /* Bitmap of broadcast instructions. */ > > > + auto_bitmap insns; > > > + /* The broadcast inner scalar. */ > > > + rtx val; > > > + /* The inner scalar mode. */ > > > + machine_mode mode; > > > + /* The instruction which sets the inner scalar. Nullptr if the inner > > > + scalar is applied to the whole function, instead of within the same > > > + block. */ > > > + rtx_insn *def_insn; > > > + /* The widest broadcast source. */ > > > + rtx broadcast_source; > > > + /* The widest broadcast register. */ > > > + rtx broadcast_reg; > > > + /* The basic block of the broadcast instruction. */ > > > + basic_block bb; > > > + /* The number of broadcast instructions with the same inner scalar. */ > > > + unsigned HOST_WIDE_INT count; > > > + /* The threshold of broadcast instructions with the same inner > > > + scalar. */ > > > + unsigned int threshold; > > > + /* The widest broadcast size in bytes. */ > > > + unsigned int size; > > > + /* Load kind. */ > > > + x86_cse_kind kind; > > > +}; > > > + > > > +/* Return the inner scalar if OP is a broadcast, else return nullptr. */ > > > + > > > +static rtx > > > +ix86_broadcast_inner (rtx op, machine_mode mode, > > > + machine_mode *scalar_mode_p, > > > + x86_cse_kind *kind_p, rtx_insn **insn_p) > > > +{ > > > + if (op == const0_rtx || op == CONST0_RTX (mode)) > > > + { > > > + *scalar_mode_p = QImode; > > > + *kind_p = X86_CSE_CONST0_VECTOR; > > > + *insn_p = nullptr; > > > + return const0_rtx; > > > + } > > > + else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT > > > + && (op == constm1_rtx || op == CONSTM1_RTX (mode))) > > > + { > > > + *scalar_mode_p = QImode; > > > + *kind_p = X86_CSE_CONSTM1_VECTOR; > > > + *insn_p = nullptr; > > > + return constm1_rtx; > > > + } > > > + > > > + mode = GET_MODE (op); > > > + int nunits = GET_MODE_NUNITS (mode); > > > + if (nunits < 2) > > > + return nullptr; > > > + > > > + *kind_p = X86_CSE_VEC_DUP; > > > + > > > + rtx reg; > > > + if (GET_CODE (op) == VEC_DUPLICATE) > > > + { > > > + /* Only > > > + (vec_duplicate:V4SI (reg:SI 99)) > > > + (vec_duplicate:V2DF (mem/u/c:DF (symbol_ref/u:DI ("*.LC1") > > > [flags 0x2]) [0 S8 A64])) > > > + are supported. */ > > > + op = XEXP (op, 0); > > > + reg = op; > > > + if (SUBREG_P (op) > > > + && SUBREG_BYTE (op) == 0 > > > + && !paradoxical_subreg_p (op)) > > > + reg = SUBREG_REG (op); > > > + if (!REG_P (reg)) > > > + { > > > + if (MEM_P (op) > > > + && SYMBOL_REF_P (XEXP (op, 0)) > > > + && CONSTANT_POOL_ADDRESS_P (XEXP (op, 0))) > > > + { > > > + /* Handle constant broadcast from memory. */ > > > + *scalar_mode_p = GET_MODE_INNER (mode); > > > + *insn_p = nullptr; > > > + return op; > > > + } > > > + return nullptr; > > > + } > > > + } > > > + else if (CONST_VECTOR_P (op)) > > > + { > > > + rtx first = XVECEXP (op, 0, 0); > > > + for (int i = 1; i < nunits; ++i) > > > + { > > > + rtx tmp = XVECEXP (op, 0, i); > > > + /* Vector duplicate value. */ > > > + if (!rtx_equal_p (tmp, first)) > > > + return nullptr; > > > + } > > > + *scalar_mode_p = GET_MODE (first); > > > + *insn_p = nullptr; > > > + return first; > > > + } > > > + else > > > + return nullptr; > > > + > > > + mode = GET_MODE (op); > > > + > > > + /* Only single def chain is supported. */ > > > + df_ref ref = DF_REG_DEF_CHAIN (REGNO (reg)); > > > + if (!ref > > > + || DF_REF_IS_ARTIFICIAL (ref) > > > + || DF_REF_NEXT_REG (ref) != nullptr) > > > + return nullptr; > > > + > > > + rtx_insn *insn = DF_REF_INSN (ref); > > > + rtx set = single_set (insn); > > > + if (!set) > > > + return nullptr; > > > + > > > + rtx dest = SET_DEST (set); > > > + > > > + op = SET_SRC (set); > > > + /* Set *INSN_P if the scalar source isn't a constant nor an incoming > > > + argument. */ > > > + if (CONST_INT_P (op) || CONST_DOUBLE_P (op)) > > > + *insn_p = nullptr; > > > + else if (REG_P (op) > > > + && REG_EXPR (op) > > > + && TREE_CODE (REG_EXPR (op)) == PARM_DECL) > > > + *insn_p = nullptr; > > > + else if (MEM_P (op) > > > + && MEM_EXPR (op) > > > + && TREE_CODE (get_base_address (MEM_EXPR (op))) == PARM_DECL) > > > + *insn_p = nullptr; > > > + else > > > + { > > > + while (SUBREG_P (dest)) > > > + dest = SUBREG_REG (dest); > > > + > > > + /* Skip if the SET destination mode doesn't match. */ > > > + if (GET_MODE (dest) != mode) > > > + return nullptr; > > > > Can we just require (dest == reg || dest == op), otherwise we need to > > make sure GET_MODE of the original dest can cover mode of op(which is > > more complicated, need to make sure SUBREG_BYTE is also zero???) > > I will change it to > > /* Skip if the SET destination isn't the broadcast source. */ > if (dest != reg) > return nullptr; > > > > + > > > + /* Set the inner scalar to the SET destination. */ > > > + op = dest; > > > + *insn_p = insn; > > > + } > > > + > > > + *scalar_mode_p = mode; > > > + if (CONSTANT_P (op)) > > > + *insn_p = nullptr; > > > + else > > > + *insn_p = insn; > > Is this really needed? Do we have any case with vec_duplicate inner > > Yes. For > > s1 = __extension__(v4si){34, 34, 34, 34}; > > we generate > > (insn 6 2 5 2 (set (reg:SI 99) > (const_int 34 [0x22])) "3.c":35:6 100 {*movsi_internal} > (nil)) > (insn 5 6 7 2 (set (reg:V4SI 98) > (vec_duplicate:V4SI (reg:SI 99))) "3.c":35:6 9240 {*vec_dupv4si} > (expr_list:REG_EQUAL (const_vector:V4SI [ > (const_int 34 [0x22]) repeated x4 > ]) > (nil))) > > We record it as > > (set (reg:V4SI 98) > (vec_duplicate:V4SI (const_int 34 [0x22]))) > > > constant instead of const_vector? Oh, I missed op = SET_SRC (set). Then shouldn't insn_p already be set as NULL by
+ if (CONST_INT_P (op) || CONST_DOUBLE_P (op)) + *insn_p = nullptr; > > > > Others LGTM > > > > > > > > -- > > BR, > > Hongtao > > > > -- > H.J. -- BR, Hongtao