Ping On Sun, Feb 14, 2021 at 11:27:40AM +0100, Stefan Schulze Frielinghaus wrote: > On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote: > > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > > > > > This patch adds support for recognizing loops which mimic the behaviour > > > of function rawmemchr, and replaces those with an internal function call > > > in case a target provides them. In contrast to the original rawmemchr > > > function, this patch also supports different instances where the memory > > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized, > > > respectively. > > > > > > This patch is not final and I'm looking for some feedback: > > > > > > Previously, only loops which mimic the behaviours of functions memset, > > > memcpy, and memmove have been detected and replaced by corresponding > > > function calls. One characteristic of those loops/partitions is that > > > they don't have a reduction. In contrast, loops which mimic the > > > behaviour of rawmemchr compute a result and therefore have a reduction. > > > My current attempt is to ensure that the reduction statement is not used > > > in any other partition and only in that case ignore the reduction and > > > replace the loop by a function call. We then only need to replace the > > > reduction variable of the loop which contained the loop result by the > > > variable of the lhs of the internal function call. This should ensure > > > that the transformation is correct independently of how partitions are > > > fused/distributed in the end. Any thoughts about this? > > > > Currently we're forcing reduction partitions last (and force to have a > > single > > one by fusing all partitions containing a reduction) because code-generation > > does not properly update SSA form for the reduction results. ISTR that > > might be just because we do not copy the LC PHI nodes or do not adjust > > them when copying. That might not be an issue in case you replace the > > partition with a call. I guess you can try to have a testcase with > > two rawmemchr patterns and a regular loop part that has to be scheduled > > inbetween both for correctness. > > Ah ok, in that case I updated my patch by removing the constraint that > the reduction statement must be in precisely one partition. Please find > attached the testcases I came up so far. Since transforming a loop into > a rawmemchr function call is backend dependend, I planned to include > those only in my backend patch. I wasn't able to come up with any > testcase where a loop is distributed into multiple partitions and where > one is classified as a rawmemchr builtin. The latter boils down to a > for loop with an empty body only in which case I suspect that loop > distribution shouldn't be done anyway. > > > > Furthermore, I simply added two new members (pattern, fn) to structure > > > builtin_info which I consider rather hacky. For the long run I thought > > > about to split up structure builtin_info into a union where each member > > > is a structure for a particular builtin of a partition, i.e., something > > > like this: > > > > > > union builtin_info > > > { > > > struct binfo_memset *memset; > > > struct binfo_memcpymove *memcpymove; > > > struct binfo_rawmemchr *rawmemchr; > > > }; > > > > > > Such that a structure for one builtin does not get "polluted" by a > > > different one. Any thoughts about this? > > > > Probably makes sense if the list of recognized patterns grow further. > > > > I see you use internal functions rather than builtin functions. I guess > > that's OK. But you use new target hooks for expansion where I think > > new optab entries similar to cmpmem would be more appropriate > > where the distinction between 8, 16 or 32 bits can be encoded in > > the modes. > > The optab implementation is really nice which allows me to use iterators > in the backend which in the end saves me some boiler plate code compared > to the previous implementation :) > > While using optabs now, I only require one additional member (pattern) > in the builtin_info struct. Thus I didn't want to overcomplicate things > and kept the single struct approach as is. > > For the long run, should I resubmit this patch once stage 1 opens or how > would you propose to proceed? > > Thanks for your review so far! > > Cheers, > Stefan > > > > > Richard. > > > > > Cheers, > > > Stefan > > > --- > > > gcc/internal-fn.c | 42 ++++++ > > > gcc/internal-fn.def | 3 + > > > gcc/target-insns.def | 3 + > > > gcc/tree-loop-distribution.c | 257 ++++++++++++++++++++++++++++++----- > > > 4 files changed, 272 insertions(+), 33 deletions(-) > > > > > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c > > > index dd7173126fb..9cd62544a1a 100644 > > > --- a/gcc/internal-fn.c > > > +++ b/gcc/internal-fn.c > > > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *) > > > gcc_unreachable (); > > > } > > > > > > +static void > > > +expand_RAWMEMCHR8 (internal_fn, gcall *stmt) > > > +{ > > > + if (targetm.have_rawmemchr8 ()) > > > + { > > > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, > > > VOIDmode, EXPAND_WRITE); > > > + rtx start = expand_normal (gimple_call_arg (stmt, 0)); > > > + rtx pattern = expand_normal (gimple_call_arg (stmt, 1)); > > > + emit_insn (targetm.gen_rawmemchr8 (result, start, pattern)); > > > + } > > > + else > > > + gcc_unreachable(); > > > +} > > > + > > > +static void > > > +expand_RAWMEMCHR16 (internal_fn, gcall *stmt) > > > +{ > > > + if (targetm.have_rawmemchr16 ()) > > > + { > > > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, > > > VOIDmode, EXPAND_WRITE); > > > + rtx start = expand_normal (gimple_call_arg (stmt, 0)); > > > + rtx pattern = expand_normal (gimple_call_arg (stmt, 1)); > > > + emit_insn (targetm.gen_rawmemchr16 (result, start, pattern)); > > > + } > > > + else > > > + gcc_unreachable(); > > > +} > > > + > > > +static void > > > +expand_RAWMEMCHR32 (internal_fn, gcall *stmt) > > > +{ > > > + if (targetm.have_rawmemchr32 ()) > > > + { > > > + rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, > > > VOIDmode, EXPAND_WRITE); > > > + rtx start = expand_normal (gimple_call_arg (stmt, 0)); > > > + rtx pattern = expand_normal (gimple_call_arg (stmt, 1)); > > > + emit_insn (targetm.gen_rawmemchr32 (result, start, pattern)); > > > + } > > > + else > > > + gcc_unreachable(); > > > +} > > > + > > > /* Expand the IFN_UNIQUE function according to its first argument. */ > > > > > > static void > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > > > index daeace7a34e..34247859704 100644 > > > --- a/gcc/internal-fn.def > > > +++ b/gcc/internal-fn.def > > > @@ -348,6 +348,9 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | > > > ECF_NOTHROW, NULL) > > > DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, > > > NULL) > > > DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL) > > > DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL) > > > +DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL) > > > +DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL) > > > +DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL) > > > > > > /* An unduplicable, uncombinable function. Generally used to preserve > > > a CFG property in the face of jump threading, tail merging or > > > diff --git a/gcc/target-insns.def b/gcc/target-insns.def > > > index 672c35698d7..9248554cbf3 100644 > > > --- a/gcc/target-insns.def > > > +++ b/gcc/target-insns.def > > > @@ -106,3 +106,6 @@ DEF_TARGET_INSN (trap, (void)) > > > DEF_TARGET_INSN (unique, (void)) > > > DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2)) > > > DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1)) > > > +DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2)) > > > +DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2)) > > > +DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx x1, rtx x2)) > > > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c > > > index 7ee19fc8677..f5b24bf53bc 100644 > > > --- a/gcc/tree-loop-distribution.c > > > +++ b/gcc/tree-loop-distribution.c > > > @@ -218,7 +218,7 @@ enum partition_kind { > > > be unnecessary and removed once distributed memset can be > > > understood > > > and analyzed in data reference analysis. See PR82604 for more. > > > */ > > > PKIND_PARTIAL_MEMSET, > > > - PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE > > > + PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR > > > }; > > > > > > /* Type of distributed loop. */ > > > @@ -244,6 +244,8 @@ struct builtin_info > > > is only used in memset builtin distribution for now. */ > > > tree dst_base_base; > > > unsigned HOST_WIDE_INT dst_base_offset; > > > + tree pattern; > > > + internal_fn fn; > > > }; > > > > > > /* Partition for loop distribution. */ > > > @@ -588,7 +590,8 @@ class loop_distribution > > > bool > > > classify_partition (loop_p loop, > > > struct graph *rdg, partition *partition, > > > - bitmap stmt_in_all_partitions); > > > + bitmap stmt_in_all_partitions, > > > + vec<struct partition *> *partitions); > > > > > > > > > /* Returns true when PARTITION1 and PARTITION2 access the same memory > > > @@ -1232,6 +1235,67 @@ generate_memcpy_builtin (class loop *loop, > > > partition *partition) > > > } > > > } > > > > > > +/* Generate a call to rawmemchr{8,16,32} for PARTITION in LOOP. */ > > > + > > > +static void > > > +generate_rawmemchr_builtin (class loop *loop, partition *partition) > > > +{ > > > + gimple_stmt_iterator gsi; > > > + tree mem, pattern; > > > + struct builtin_info *builtin = partition->builtin; > > > + gimple *fn_call; > > > + > > > + data_reference_p dr = builtin->src_dr; > > > + tree base = builtin->src_base; > > > + > > > + tree result_old = TREE_OPERAND (DR_REF (dr), 0); > > > + tree result_new = copy_ssa_name (result_old); > > > + > > > + /* The new statements will be placed before LOOP. */ > > > + gsi = gsi_last_bb (loop_preheader_edge (loop)->src); > > > + > > > + mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, > > > GSI_CONTINUE_LINKING); > > > + pattern = builtin->pattern; > > > + if (TREE_CODE (pattern) == INTEGER_CST) > > > + pattern = fold_convert (integer_type_node, pattern); > > > + fn_call = gimple_build_call_internal (builtin->fn, 2, mem, pattern); > > > + gimple_call_set_lhs (fn_call, result_new); > > > + gimple_set_location (fn_call, partition->loc); > > > + gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING); > > > + > > > + imm_use_iterator iter; > > > + gimple *stmt; > > > + use_operand_p use_p; > > > + FOR_EACH_IMM_USE_STMT (stmt, iter, result_old) > > > + { > > > + FOR_EACH_IMM_USE_ON_STMT (use_p, iter) > > > + SET_USE (use_p, result_new); > > > + > > > + update_stmt (stmt); > > > + } > > > + > > > + fold_stmt (&gsi); > > > + > > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > > + switch (builtin->fn) > > > + { > > > + case IFN_RAWMEMCHR8: > > > + fprintf (dump_file, "generated rawmemchr8\n"); > > > + break; > > > + > > > + case IFN_RAWMEMCHR16: > > > + fprintf (dump_file, "generated rawmemchr16\n"); > > > + break; > > > + > > > + case IFN_RAWMEMCHR32: > > > + fprintf (dump_file, "generated rawmemchr32\n"); > > > + break; > > > + > > > + default: > > > + gcc_unreachable (); > > > + } > > > +} > > > + > > > /* Remove and destroy the loop LOOP. */ > > > > > > static void > > > @@ -1334,6 +1398,10 @@ generate_code_for_partition (class loop *loop, > > > generate_memcpy_builtin (loop, partition); > > > break; > > > > > > + case PKIND_RAWMEMCHR: > > > + generate_rawmemchr_builtin (loop, partition); > > > + break; > > > + > > > default: > > > gcc_unreachable (); > > > } > > > @@ -1525,44 +1593,53 @@ find_single_drs (class loop *loop, struct graph > > > *rdg, partition *partition, > > > } > > > } > > > > > > - if (!single_st) > > > + if (!single_ld && !single_st) > > > return false; > > > > > > - /* Bail out if this is a bitfield memory reference. */ > > > - if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF > > > - && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1))) > > > - return false; > > > - > > > - /* Data reference must be executed exactly once per iteration of each > > > - loop in the loop nest. We only need to check dominance information > > > - against the outermost one in a perfect loop nest because a bb can't > > > - dominate outermost loop's latch without dominating inner loop's. */ > > > - basic_block bb_st = gimple_bb (DR_STMT (single_st)); > > > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st)) > > > - return false; > > > + basic_block bb_ld = NULL; > > > + basic_block bb_st = NULL; > > > > > > if (single_ld) > > > { > > > - gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld); > > > - /* Direct aggregate copy or via an SSA name temporary. */ > > > - if (load != store > > > - && gimple_assign_lhs (load) != gimple_assign_rhs1 (store)) > > > - return false; > > > - > > > /* Bail out if this is a bitfield memory reference. */ > > > if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF > > > && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1))) > > > return false; > > > > > > - /* Load and store must be in the same loop nest. */ > > > - basic_block bb_ld = gimple_bb (DR_STMT (single_ld)); > > > - if (bb_st->loop_father != bb_ld->loop_father) > > > + /* Data reference must be executed exactly once per iteration of > > > each > > > + loop in the loop nest. We only need to check dominance > > > information > > > + against the outermost one in a perfect loop nest because a bb > > > can't > > > + dominate outermost loop's latch without dominating inner loop's. > > > */ > > > + bb_ld = gimple_bb (DR_STMT (single_ld)); > > > + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld)) > > > + return false; > > > + } > > > + > > > + if (single_st) > > > + { > > > + /* Bail out if this is a bitfield memory reference. */ > > > + if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF > > > + && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1))) > > > return false; > > > > > > /* Data reference must be executed exactly once per iteration. > > > - Same as single_st, we only need to check against the outermost > > > + Same as single_ld, we only need to check against the outermost > > > loop. */ > > > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld)) > > > + bb_st = gimple_bb (DR_STMT (single_st)); > > > + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st)) > > > + return false; > > > + } > > > + > > > + if (single_ld && single_st) > > > + { > > > + gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld); > > > + /* Direct aggregate copy or via an SSA name temporary. */ > > > + if (load != store > > > + && gimple_assign_lhs (load) != gimple_assign_rhs1 (store)) > > > + return false; > > > + > > > + /* Load and store must be in the same loop nest. */ > > > + if (bb_st->loop_father != bb_ld->loop_father) > > > return false; > > > > > > edge e = single_exit (bb_st->loop_father); > > > @@ -1681,6 +1758,84 @@ alloc_builtin (data_reference_p dst_dr, > > > data_reference_p src_dr, > > > return builtin; > > > } > > > > > > +/* Given data reference DR in loop nest LOOP, classify if it forms > > > builtin > > > + rawmemchr{8,16,32} call. */ > > > + > > > +static bool > > > +classify_builtin_rawmemchr (loop_p loop, partition *partition, > > > data_reference_p dr, tree loop_result) > > > +{ > > > + tree dr_ref = DR_REF (dr); > > > + tree dr_access_base = build_fold_addr_expr (dr_ref); > > > + tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref)); > > > + gimple *dr_stmt = DR_STMT (dr); > > > + tree rhs1 = gimple_assign_rhs1 (dr_stmt); > > > + affine_iv iv; > > > + tree pattern; > > > + > > > + if (TREE_OPERAND (rhs1, 0) != loop_result) > > > + return false; > > > + > > > + /* A limitation of the current implementation is that we only support > > > + constant patterns. */ > > > + gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header)); > > > + pattern = gimple_cond_rhs (cond_stmt); > > > + if (gimple_cond_code (cond_stmt) != NE_EXPR > > > + || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt) > > > + || TREE_CODE (pattern) != INTEGER_CST) > > > + return false; > > > + > > > + /* Bail out if no affine induction variable with constant step can be > > > + determined. */ > > > + if (!simple_iv (loop, loop, dr_access_base, &iv, false)) > > > + return false; > > > + > > > + /* Bail out if memory accesses are not consecutive. */ > > > + if (!operand_equal_p (iv.step, dr_access_size, 0)) > > > + return false; > > > + > > > + /* Bail out if direction of memory accesses is not growing. */ > > > + if (get_range_pos_neg (iv.step) != 1) > > > + return false; > > > + > > > + internal_fn fn; > > > + switch (TREE_INT_CST_LOW (iv.step)) > > > + { > > > + case 1: > > > + if (!targetm.have_rawmemchr8 ()) > > > + return false; > > > + fn = IFN_RAWMEMCHR8; > > > + break; > > > + > > > + case 2: > > > + if (!targetm.have_rawmemchr16 ()) > > > + return false; > > > + fn = IFN_RAWMEMCHR16; > > > + break; > > > + > > > + case 4: > > > + if (!targetm.have_rawmemchr32 ()) > > > + return false; > > > + fn = IFN_RAWMEMCHR32; > > > + break; > > > + > > > + default: > > > + return false; > > > + } > > > + > > > + struct builtin_info *builtin; > > > + builtin = alloc_builtin (NULL, NULL, NULL_TREE, NULL_TREE, NULL_TREE); > > > + builtin->src_dr = dr; > > > + builtin->src_base = iv.base; > > > + builtin->pattern = pattern; > > > + builtin->fn = fn; > > > + > > > + partition->loc = gimple_location (dr_stmt); > > > + partition->builtin = builtin; > > > + partition->kind = PKIND_RAWMEMCHR; > > > + > > > + return true; > > > +} > > > + > > > /* Given data reference DR in loop nest LOOP, classify if it forms > > > builtin > > > memset call. */ > > > > > > @@ -1792,12 +1947,16 @@ loop_distribution::classify_builtin_ldst (loop_p > > > loop, struct graph *rdg, > > > bool > > > loop_distribution::classify_partition (loop_p loop, > > > struct graph *rdg, partition > > > *partition, > > > - bitmap stmt_in_all_partitions) > > > + bitmap stmt_in_all_partitions, > > > + vec<struct partition *> > > > *partitions) > > > { > > > bitmap_iterator bi; > > > unsigned i; > > > data_reference_p single_ld = NULL, single_st = NULL; > > > bool volatiles_p = false, has_reduction = false; > > > + unsigned nreductions = 0; > > > + gimple *reduction_stmt = NULL; > > > + bool has_interpar_reduction = false; > > > > > > EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi) > > > { > > > @@ -1821,6 +1980,19 @@ loop_distribution::classify_partition (loop_p loop, > > > partition->reduction_p = true; > > > else > > > has_reduction = true; > > > + > > > + /* Determine whether the reduction statement occurs in other > > > + partitions than the current one. */ > > > + struct partition *piter; > > > + for (unsigned j = 0; partitions->iterate (j, &piter); ++j) > > > + { > > > + if (piter == partition) > > > + continue; > > > + if (bitmap_bit_p (piter->stmts, i)) > > > + has_interpar_reduction = true; > > > + } > > > + reduction_stmt = stmt; > > > + ++nreductions; > > > } > > > } > > > > > > @@ -1840,6 +2012,30 @@ loop_distribution::classify_partition (loop_p loop, > > > if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld)) > > > return has_reduction; > > > > > > + /* If we determined a single load and a single reduction statement > > > which does > > > + not occur in any other partition, then try to classify this > > > partition as a > > > + rawmemchr builtin. */ > > > + if (single_ld != NULL > > > + && single_st == NULL > > > + && nreductions == 1 > > > + && !has_interpar_reduction > > > + && is_gimple_assign (reduction_stmt)) > > > + { > > > + /* If we classified the partition as a builtin, then ignoring the > > > single > > > + reduction is safe, since the reduction variable is not used in > > > other > > > + partitions. */ > > > + tree reduction_var = gimple_assign_lhs (reduction_stmt); > > > + return !classify_builtin_rawmemchr (loop, partition, single_ld, > > > reduction_var); > > > + } > > > + > > > + if (single_st == NULL) > > > + return has_reduction; > > > + > > > + /* Don't distribute loop if niters is unknown. */ > > > + tree niters = number_of_latch_executions (loop); > > > + if (niters == NULL_TREE || niters == chrec_dont_know) > > > + return has_reduction; > > > + > > > partition->loc = gimple_location (DR_STMT (single_st)); > > > > > > /* Classify the builtin kind. */ > > > @@ -2979,7 +3175,7 @@ loop_distribution::distribute_loop (class loop > > > *loop, vec<gimple *> stmts, > > > FOR_EACH_VEC_ELT (partitions, i, partition) > > > { > > > reduction_in_all > > > - |= classify_partition (loop, rdg, partition, > > > stmt_in_all_partitions); > > > + |= classify_partition (loop, rdg, partition, > > > stmt_in_all_partitions, &partitions); > > > any_builtin |= partition_builtin_p (partition); > > > } > > > > > > @@ -3290,11 +3486,6 @@ loop_distribution::execute (function *fun) > > > && !optimize_loop_for_speed_p (loop))) > > > continue; > > > > > > - /* Don't distribute loop if niters is unknown. */ > > > - tree niters = number_of_latch_executions (loop); > > > - if (niters == NULL_TREE || niters == chrec_dont_know) > > > - continue; > > > - > > > /* Get the perfect loop nest for distribution. */ > > > loop = prepare_perfect_loop_nest (loop); > > > for (; loop; loop = loop->inner) > > > -- > > > 2.23.0 > > >
> commit bf792239150 > Author: Stefan Schulze Frielinghaus <stefa...@linux.ibm.com> > Date: Mon Feb 8 10:35:30 2021 +0100 > > ldist: Recognize rawmemchr loop patterns > > This patch adds support for recognizing loops which mimic the behaviour > of function rawmemchr, and replaces those with an internal function call > in case a target provides them. In contrast to the original rawmemchr > function, this patch also supports different instances where the memory > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized, > respectively. > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c > index dd7173126fb..18e12b863c6 100644 > --- a/gcc/internal-fn.c > +++ b/gcc/internal-fn.c > @@ -2917,6 +2917,31 @@ expand_VEC_CONVERT (internal_fn, gcall *) > gcc_unreachable (); > } > > +void > +expand_RAWMEMCHR (internal_fn, gcall *stmt) > +{ > + expand_operand ops[3]; > + > + tree lhs = gimple_call_lhs (stmt); > + tree lhs_type = TREE_TYPE (lhs); > + rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); > + create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type)); > + > + for (unsigned int i = 0; i < 2; ++i) > + { > + tree rhs = gimple_call_arg (stmt, i); > + tree rhs_type = TREE_TYPE (rhs); > + rtx rhs_rtx = expand_normal (rhs); > + create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type)); > + } > + > + insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode); > + > + expand_insn (icode, 3, ops); > + if (!rtx_equal_p (lhs_rtx, ops[0].value)) > + emit_move_insn (lhs_rtx, ops[0].value); > +} > + > /* Expand the IFN_UNIQUE function according to its first argument. */ > > static void > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > index daeace7a34e..95c76795648 100644 > --- a/gcc/internal-fn.def > +++ b/gcc/internal-fn.def > @@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | > ECF_NOTHROW, NULL) > DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL) > DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL) > DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL) > +DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL) > > /* An unduplicable, uncombinable function. Generally used to preserve > a CFG property in the face of jump threading, tail merging or > diff --git a/gcc/optabs.def b/gcc/optabs.def > index b192a9d070b..f7c69f914ce 100644 > --- a/gcc/optabs.def > +++ b/gcc/optabs.def > @@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a") > OPTAB_D (movmem_optab, "movmem$a") > OPTAB_D (setmem_optab, "setmem$a") > OPTAB_D (strlen_optab, "strlen$a") > +OPTAB_D (rawmemchr_optab, "rawmemchr$I$a") > > OPTAB_DC(fma_optab, "fma$a4", FMA) > OPTAB_D (fms_optab, "fms$a4") > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c > index 7ee19fc8677..09f200da61f 100644 > --- a/gcc/tree-loop-distribution.c > +++ b/gcc/tree-loop-distribution.c > @@ -115,6 +115,10 @@ along with GCC; see the file COPYING3. If not see > #include "tree-vectorizer.h" > #include "tree-eh.h" > #include "gimple-fold.h" > +#include "rtl.h" > +#include "memmodel.h" > +#include "insn-codes.h" > +#include "optabs.h" > > > #define MAX_DATAREFS_NUM \ > @@ -218,7 +222,7 @@ enum partition_kind { > be unnecessary and removed once distributed memset can be understood > and analyzed in data reference analysis. See PR82604 for more. */ > PKIND_PARTIAL_MEMSET, > - PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE > + PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR > }; > > /* Type of distributed loop. */ > @@ -244,6 +248,8 @@ struct builtin_info > is only used in memset builtin distribution for now. */ > tree dst_base_base; > unsigned HOST_WIDE_INT dst_base_offset; > + /* Pattern is used only in rawmemchr builtin distribution for now. */ > + tree pattern; > }; > > /* Partition for loop distribution. */ > @@ -1232,6 +1238,66 @@ generate_memcpy_builtin (class loop *loop, partition > *partition) > } > } > > +/* Generate a call to rawmemchr for PARTITION in LOOP. */ > + > +static void > +generate_rawmemchr_builtin (class loop *loop, partition *partition) > +{ > + gimple_stmt_iterator gsi; > + tree mem, pattern; > + struct builtin_info *builtin = partition->builtin; > + gimple *fn_call; > + > + data_reference_p dr = builtin->src_dr; > + tree base = builtin->src_base; > + > + tree result_old = build_fold_addr_expr (DR_REF (dr)); > + tree result_new = copy_ssa_name (result_old); > + > + /* The new statements will be placed before LOOP. */ > + gsi = gsi_last_bb (loop_preheader_edge (loop)->src); > + > + mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, > + GSI_CONTINUE_LINKING); > + pattern = builtin->pattern; > + fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern); > + gimple_call_set_lhs (fn_call, result_new); > + gimple_set_location (fn_call, partition->loc); > + gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING); > + > + imm_use_iterator iter; > + gimple *stmt; > + use_operand_p use_p; > + FOR_EACH_IMM_USE_STMT (stmt, iter, result_old) > + { > + FOR_EACH_IMM_USE_ON_STMT (use_p, iter) > + SET_USE (use_p, result_new); > + > + update_stmt (stmt); > + } > + > + fold_stmt (&gsi); > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + switch (TYPE_MODE (TREE_TYPE (pattern))) > + { > + case QImode: > + fprintf (dump_file, "generated rawmemchrqi\n"); > + break; > + > + case HImode: > + fprintf (dump_file, "generated rawmemchrhi\n"); > + break; > + > + case SImode: > + fprintf (dump_file, "generated rawmemchrsi\n"); > + break; > + > + default: > + gcc_unreachable (); > + } > +} > + > /* Remove and destroy the loop LOOP. */ > > static void > @@ -1334,6 +1400,10 @@ generate_code_for_partition (class loop *loop, > generate_memcpy_builtin (loop, partition); > break; > > + case PKIND_RAWMEMCHR: > + generate_rawmemchr_builtin (loop, partition); > + break; > + > default: > gcc_unreachable (); > } > @@ -1525,44 +1595,53 @@ find_single_drs (class loop *loop, struct graph *rdg, > partition *partition, > } > } > > - if (!single_st) > + if (!single_ld && !single_st) > return false; > > - /* Bail out if this is a bitfield memory reference. */ > - if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF > - && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1))) > - return false; > - > - /* Data reference must be executed exactly once per iteration of each > - loop in the loop nest. We only need to check dominance information > - against the outermost one in a perfect loop nest because a bb can't > - dominate outermost loop's latch without dominating inner loop's. */ > - basic_block bb_st = gimple_bb (DR_STMT (single_st)); > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st)) > - return false; > + basic_block bb_ld = NULL; > + basic_block bb_st = NULL; > > if (single_ld) > { > - gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld); > - /* Direct aggregate copy or via an SSA name temporary. */ > - if (load != store > - && gimple_assign_lhs (load) != gimple_assign_rhs1 (store)) > - return false; > - > /* Bail out if this is a bitfield memory reference. */ > if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF > && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1))) > return false; > > - /* Load and store must be in the same loop nest. */ > - basic_block bb_ld = gimple_bb (DR_STMT (single_ld)); > - if (bb_st->loop_father != bb_ld->loop_father) > + /* Data reference must be executed exactly once per iteration of each > + loop in the loop nest. We only need to check dominance information > + against the outermost one in a perfect loop nest because a bb can't > + dominate outermost loop's latch without dominating inner loop's. */ > + bb_ld = gimple_bb (DR_STMT (single_ld)); > + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld)) > + return false; > + } > + > + if (single_st) > + { > + /* Bail out if this is a bitfield memory reference. */ > + if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF > + && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1))) > return false; > > /* Data reference must be executed exactly once per iteration. > - Same as single_st, we only need to check against the outermost > + Same as single_ld, we only need to check against the outermost > loop. */ > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld)) > + bb_st = gimple_bb (DR_STMT (single_st)); > + if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st)) > + return false; > + } > + > + if (single_ld && single_st) > + { > + gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld); > + /* Direct aggregate copy or via an SSA name temporary. */ > + if (load != store > + && gimple_assign_lhs (load) != gimple_assign_rhs1 (store)) > + return false; > + > + /* Load and store must be in the same loop nest. */ > + if (bb_st->loop_father != bb_ld->loop_father) > return false; > > edge e = single_exit (bb_st->loop_father); > @@ -1681,6 +1760,68 @@ alloc_builtin (data_reference_p dst_dr, > data_reference_p src_dr, > return builtin; > } > > +/* Given data reference DR in loop nest LOOP, classify if it forms builtin > + rawmemchr call. */ > + > +static bool > +classify_builtin_rawmemchr (loop_p loop, partition *partition, > + data_reference_p dr, tree loop_result) > +{ > + tree dr_ref = DR_REF (dr); > + tree dr_access_base = build_fold_addr_expr (dr_ref); > + tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref)); > + gimple *dr_stmt = DR_STMT (dr); > + affine_iv iv; > + tree pattern; > + > + if (dr_access_base != loop_result) > + return false; > + > + /* A limitation of the current implementation is that we only support > + constant patterns. */ > + gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header)); > + pattern = gimple_cond_rhs (cond_stmt); > + if (gimple_cond_code (cond_stmt) != NE_EXPR > + || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt) > + || TREE_CODE (pattern) != INTEGER_CST) > + return false; > + > + /* Bail out if no affine induction variable with constant step can be > + determined. */ > + if (!simple_iv (loop, loop, dr_access_base, &iv, false)) > + return false; > + > + /* Bail out if memory accesses are not consecutive. */ > + if (!operand_equal_p (iv.step, dr_access_size, 0)) > + return false; > + > + /* Bail out if direction of memory accesses is not growing. */ > + if (get_range_pos_neg (iv.step) != 1) > + return false; > + > + /* Bail out if target does not provide rawmemchr for a certain mode. */ > + machine_mode mode; > + switch (TREE_INT_CST_LOW (iv.step)) > + { > + case 1: mode = QImode; break; > + case 2: mode = HImode; break; > + case 4: mode = SImode; break; > + default: return false; > + } > + if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing) > + return false; > + > + struct builtin_info *builtin; > + builtin = alloc_builtin (NULL, dr, NULL_TREE, iv.base, NULL_TREE); > + builtin->pattern = pattern; > + > + partition->loc = gimple_location (dr_stmt); > + partition->builtin = builtin; > + partition->kind = PKIND_RAWMEMCHR; > + > + return true; > +} > + > /* Given data reference DR in loop nest LOOP, classify if it forms builtin > memset call. */ > > @@ -1798,6 +1939,8 @@ loop_distribution::classify_partition (loop_p loop, > unsigned i; > data_reference_p single_ld = NULL, single_st = NULL; > bool volatiles_p = false, has_reduction = false; > + unsigned nreductions = 0; > + gimple *reduction_stmt = NULL; > > EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi) > { > @@ -1821,6 +1964,10 @@ loop_distribution::classify_partition (loop_p loop, > partition->reduction_p = true; > else > has_reduction = true; > + > + /* Determine whether STMT is the only reduction statement or not. */ > + reduction_stmt = stmt; > + ++nreductions; > } > } > > @@ -1840,6 +1987,27 @@ loop_distribution::classify_partition (loop_p loop, > if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld)) > return has_reduction; > > + /* If we determined a single load and a single reduction statement, then > try > + to classify this partition as a rawmemchr builtin. */ > + if (single_ld != NULL > + && single_st == NULL > + && nreductions == 1 > + && is_gimple_assign (reduction_stmt)) > + { > + /* If we classified the partition as a builtin, then ignoring the > single > + reduction is safe, since the whole partition is replaced by a call. */ > + tree reduction_var = gimple_assign_lhs (reduction_stmt); > + return !classify_builtin_rawmemchr (loop, partition, single_ld, > reduction_var); > + } > + > + if (single_st == NULL) > + return has_reduction; > + > + /* Don't distribute loop if niters is unknown. */ > + tree niters = number_of_latch_executions (loop); > + if (niters == NULL_TREE || niters == chrec_dont_know) > + return has_reduction; > + > partition->loc = gimple_location (DR_STMT (single_st)); > > /* Classify the builtin kind. */ > @@ -3290,11 +3458,6 @@ loop_distribution::execute (function *fun) > && !optimize_loop_for_speed_p (loop))) > continue; > > - /* Don't distribute loop if niters is unknown. */ > - tree niters = number_of_latch_executions (loop); > - if (niters == NULL_TREE || niters == chrec_dont_know) > - continue; > - > /* Get the perfect loop nest for distribution. */ > loop = prepare_perfect_loop_nest (loop); > for (; loop; loop = loop->inner) > /* { dg-do compile } */ > /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */ > > #include <cstdint> > > template<typename T, T pattern> > T *rawmemchr(T *s) > { > while (*s != pattern) > ++s; > return s; > } > > template uint8_t *rawmemchr<uint8_t, 0xab>(uint8_t *); > template uint16_t *rawmemchr<uint16_t, 0xabcd>(uint16_t *); > template uint32_t *rawmemchr<uint32_t, 0xabcdef15>(uint32_t *); > > template int8_t *rawmemchr<int8_t, (int8_t)0xab>(int8_t *); > template int16_t *rawmemchr<int16_t, (int16_t)0xabcd>(int16_t *); > template int32_t *rawmemchr<int32_t, (int32_t)0xabcdef15>(int32_t *); > > /* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" } } */ > /* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" } } */ > /* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" } } */ > /* { dg-do run } */ > /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details > -mzarch -march=z13" } */ > > #include <cstring> > #include <cassert> > #include <cstdint> > #include <iostream> > > template <typename T, T pattern> > __attribute__((noinline,noclone)) > T* rawmemchr (T *s) > { > while (*s != pattern) > ++s; > return s; > } > > template <typename T, T pattern> > void doit() > { > T *buf = new T[4096 * 2]; > assert (buf != NULL); > memset (buf, 0xa, 4096 * 2); > // ensure q is 4096-byte aligned > T *q = buf + (4096 - ((uintptr_t)buf & 4095)); > T *p; > > // unaligned + block boundary + 1st load > p = (T *) ((uintptr_t)q - 8); > p[2] = pattern; > assert ((rawmemchr<T, pattern> (&p[0]) == &p[2])); > p[2] = (T) 0xaaaaaaaa; > > // unaligned + block boundary + 2nd load > p = (T *) ((uintptr_t)q - 8); > p[6] = pattern; > assert ((rawmemchr<T, pattern> (&p[0]) == &p[6])); > p[6] = (T) 0xaaaaaaaa; > > > > // unaligned + 1st load > q[5] = pattern; > assert ((rawmemchr<T, pattern>(&q[2]) == &q[5])); > q[5] = (T) 0xaaaaaaaa; > > // unaligned + 2nd load > q[14] = pattern; > assert ((rawmemchr<T, pattern>(&q[2]) == &q[14])); > q[14] = (T) 0xaaaaaaaa; > > // unaligned + 3rd load > q[19] = pattern; > assert ((rawmemchr<T, pattern>(&q[2]) == &q[19])); > q[19] = (T) 0xaaaaaaaa; > > // unaligned + 4th load > q[25] = pattern; > assert ((rawmemchr<T, pattern>(&q[2]) == &q[25])); > q[25] = (T) 0xaaaaaaaa; > > > > // aligned + 1st load > q[5] = pattern; > assert ((rawmemchr<T, pattern>(&q[0]) == &q[5])); > q[5] = (T) 0xaaaaaaaa; > > // aligned + 2nd load > q[14] = pattern; > assert ((rawmemchr<T, pattern>(&q[0]) == &q[14])); > q[14] = (T) 0xaaaaaaaa; > > // aligned + 3rd load > q[19] = pattern; > assert ((rawmemchr<T, pattern>(&q[0]) == &q[19])); > q[19] = (T) 0xaaaaaaaa; > > // aligned + 4th load > q[25] = pattern; > assert ((rawmemchr<T, pattern>(&q[0]) == &q[25])); > q[25] = (T) 0xaaaaaaaa; > > delete buf; > } > > int main(void) > { > doit<uint8_t, 0xde> (); > doit<int8_t, (int8_t)0xde> (); > doit<uint16_t, 0xdead> (); > doit<int16_t, (int16_t)0xdead> (); > doit<uint32_t, 0xdeadbeef> (); > doit<int32_t, (int32_t)0xdeadbeef> (); > return 0; > } > > /* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" } } */ > /* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" } } */ > /* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" } } */