https://gcc.gnu.org/g:65a3849eb46df2fbac6b41ff78dae13c85387f9e
commit r16-5780-g65a3849eb46df2fbac6b41ff78dae13c85387f9e Author: Tamar Christina <[email protected]> Date: Sun Nov 30 07:29:50 2025 +0000 vect: support vectorization of early break forced live IVs as scalar Consider this simple loop long long arr[1024]; long long *f() { int i; for (i = 0; i < 1024; i++) if (arr[i] == 42) break; return arr + i; } where today we generate this at -O3: .L2: add v29.4s, v29.4s, v25.4s add v28.4s, v28.4s, v26.4s cmp x2, x1 beq .L9 .L6: ldp q30, q31, [x1], 32 cmeq v30.2d, v30.2d, v27.2d cmeq v31.2d, v31.2d, v27.2d addhn v31.2s, v31.2d, v30.2d fmov x3, d31 cbz x3, .L2 but which is highly inefficient. This loops has 3 IVs (PR119577), one normal scalar one, two vector ones, one counting up and one counting down (PR115120) and has a forced unrolling due to an increase in VF because of the mismatch in modes between the IVs and the loop body (PR119860). This patch fixed all three of these issues and we now generate: .L2: add w2, w2, 2 cmp w2, 1024 beq .L13 .L5: ldr q31, [x1] add x1, x1, 16 cmeq v31.2d, v31.2d, v30.2d umaxp v31.4s, v31.4s, v31.4s fmov x0, d31 cbz x0, .L2 or with sve .L3: add x1, x1, x3 whilelo p7.d, w1, w2 b.none .L11 .L4: ld1d z30.d, p7/z, [x0, x1, lsl 3] cmpeq p7.d, p7/z, z30.d, z31.d b.none .L3 which shows that the new scalar IV is efficiently merged with the loop control one based on IVopts. To accomplish this the patch reworks how we handle "forced lived inductions" with regard to vectorization. Prior to this change when we vectorize a loop with early break any induction variables would be forced live. Forcing live means that even though the values aren't used inside the loop we must preserve the values such that when we start the scalar loop we can pass the correct initial values. However this had several side-effects: 1. We must be able to vectorize the induction. 2. The induction variable participates in VF determination. This would often times lead to a higher VF than would have normally been needed. As such the vector loops become less profitable. 3. IVcannon on constant loop iterations inserts a downward counting IV in addition to the upwards one in order to support things like doloops. Normally this duplicate IV is removed by IV opts, but IV doesn't understand vector inductions. As such we end up with 3 IVs. This patch fixes all three of these by choosing instead to create a new scalar IV that's adjusted within the loop and to update all the IV statements outside the loop by using this new IV. We re-use vect_update_ivs_after_vectorizer for all exits now and put in a dummy value representing the IV that is to be generated later. To do this we delay when we call vect_update_ivs_after_vectorizer until after the skip_epilogue edge is created and vect_update_ivs_after_vectorizer now updates all out of loop usages of IVs and not just that in the merge edge to the scalar loop. This not only generates better code, but negates the need to fixup the "forced live" scalar IVs later on. This new scalar IV is then materialized in vect_update_ivs_after_vectorizer_for_early_breaks. When PFA using masks by skipping iterations we now roll up the pfa IV into the new scalar IV by adjusting the first iteration back from start - niters_peel and then take the MAX <scal_iv, 0> to correctly handle the first iteration. Because we are now re-using vect_update_ivs_after_vectorizer we have an issue with UB clamping on non-linear inductions. At the moment when doing early exit updating I just ignore the possibility of UB since if the main exit is OK, the early exit is one iteration behind the main one and so should be ok. Things however get complicated with PEELED loops. gcc/ChangeLog: PR tree-optimization/115120 PR tree-optimization/119577 PR tree-optimization/119860 * tree-vect-loop-manip.cc (vect_can_advance_ivs_p): Check for nonlinear mult induction and early break. (vect_update_ivs_after_vectorizer): Support early break exits. (vect_do_peeling): Support scalar IVs. * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Support early break. (vect_update_nonlinear_iv): use `unsigned_type_for` such that function works for both vector and scalar types. (vectorizable_induction, vectorizable_live_operation): Remove vector early break IV code. (vect_update_ivs_after_vectorizer_for_early_breaks): New. (vect_transform_loop): Support new scalar IV for early break. * tree-vect-slp.cc (vect_analyze_slp): Remove SLP build for early break IVs. * tree-vect-stmts.cc (vect_stmt_relevant_p): No longer mark early break IVs as completely unused rather than used_only_live. They no longer contribute to the vector loop and so should not be analyzed. (can_vectorize_live_stmts): Remove vector early vreak IV code. * tree-vectorizer.h (LOOP_VINFO_EARLY_BRK_NITERS_VAR): New. (class loop_vec_info): Add early_break_niters_var. gcc/testsuite/ChangeLog: PR tree-optimization/115120 PR tree-optimization/119577 PR tree-optimization/119860 * gcc.dg/vect/vect-early-break_39.c: Update. * gcc.dg/vect/vect-early-break_139.c: New testcase. * gcc.target/aarch64/sve/peel_ind_10.c: Update. * gcc.target/aarch64/sve/peel_ind_11.c: Update. * gcc.target/aarch64/sve/peel_ind_12.c: Update. * gcc.target/aarch64/sve/peel_ind_5.c: Update. * gcc.target/aarch64/sve/peel_ind_6.c: Update. * gcc.target/aarch64/sve/peel_ind_7.c: Update. * gcc.target/aarch64/sve/peel_ind_9.c: Update. * gcc.target/aarch64/sve/pr119351.c Diff: --- gcc/testsuite/gcc.dg/vect/vect-early-break_139.c | 37 ++++ gcc/testsuite/gcc.dg/vect/vect-early-break_39.c | 5 +- gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c | 1 - gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c | 1 - gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c | 1 - gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c | 1 - gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c | 1 - gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c | 1 - gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c | 4 +- gcc/testsuite/gcc.target/aarch64/sve/pr119351.c | 1 - gcc/tree-vect-loop-manip.cc | 77 +++++-- gcc/tree-vect-loop.cc | 238 ++++++++++++--------- gcc/tree-vect-slp.cc | 42 ---- gcc/tree-vect-stmts.cc | 28 +-- gcc/tree-vectorizer.h | 8 +- 15 files changed, 246 insertions(+), 200 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_139.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_139.c new file mode 100644 index 000000000000..95994936c432 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_139.c @@ -0,0 +1,37 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break_hw } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +__attribute__((noipa)) +unsigned loop9(unsigned char *a, unsigned n, unsigned c) +{ + for (unsigned j = 0;;) + { + if (c <= j) + __builtin_abort(); + + unsigned char *slot = (unsigned char *)a + j; + + *slot = (char)j; + + unsigned d = j + 1; + if (d < n) + j = d; + else + return d; + } +} + +int main () +{ + check_vect (); + + unsigned char buff[16] = {0}; + unsigned res = loop9 (buff, 16, 20); + if (res != 16) + __builtin_abort (); +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c index b3f40b8c9ba4..bc862ad20e68 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c @@ -23,5 +23,6 @@ unsigned test4(unsigned x, unsigned n) return ret; } -/* cannot safely vectorize this due due to the group misalignment. */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0 "vect" } } */ +/* AArch64 will scalarize the load and is able to vectorize it. */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 1 "vect" { target aarch64*-*-* } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0 "vect" { target { ! aarch64*-*-* } } } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c index b7a7bc5cb0cf..43abd01c078d 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c @@ -20,5 +20,4 @@ foo (int start) } /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */ /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" "vect" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c index feb7ee7d61c9..37806adea7b9 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c @@ -15,6 +15,5 @@ foo (int *a) { } /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */ /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" "vect" } } */ /* { dg-final { scan-assembler {\tnot\tp[0-7]\.b, p[0-7]/z, p.*\n} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c index 260482a94df7..e3ed63afb05c 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c @@ -15,7 +15,6 @@ foo (int *restrict a, int * restrict b) { } /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */ /* { dg-final { scan-tree-dump "Both peeling and versioning will be applied" "vect" } } */ /* { dg-final { scan-assembler {\tnot\tp[0-7]\.b, p[0-7]/z, p.*\n} } } */ /* { dg-final { scan-assembler {\teor\t.*\n} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c index a03bb1dec21e..1977bf3af2db 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c @@ -20,5 +20,4 @@ foo (void) } /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */ /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" "vect" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c index 9bfd1a65c4fe..0b40d26ae2a3 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c @@ -20,5 +20,4 @@ foo (int start) } /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */ /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" "vect" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c index 0182e131a173..7a24d689e95a 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c @@ -20,5 +20,4 @@ foo (void) } /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */ /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" "vect" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c index cc904e88170f..136d18c2ea89 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c @@ -20,6 +20,6 @@ foo (void) } /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ -/* Peels using a scalar loop. */ -/* { dg-final { scan-tree-dump-not "pfa_iv_offset" "vect" } } */ +/* Peels using fully masked loop. */ +/* { dg-final { scan-tree-dump "misalignment for fully-masked loop" "vect" } } */ /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" "vect" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c index 1ebc735a82f4..1aca6c7de1d4 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c @@ -33,6 +33,5 @@ foo (void) } /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */ /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" "vect" } } */ diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 9ddf9acf2f19..43847c4c3fbd 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -2161,6 +2161,16 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo, return false; } + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo) + && induction_type == vect_step_op_mul) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "Peeling for is not supported for nonlinear mult" + " induction using partial vectorization.\n"); + return false; + } + /* Avoid compile time hog on vect_peel_nonlinear_iv_init. */ if (induction_type == vect_step_op_mul) { @@ -2315,6 +2325,9 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo) The phi args associated with the edge UPDATE_E in the bb UPDATE_E->dest are updated accordingly. + - EARLY_EXIT_P - Indicates whether the exit is an early exit rather than + the main latch exit. + Assumption 1: Like the rest of the vectorizer, this function assumes a single loop exit that has a single predecessor. @@ -2333,7 +2346,8 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo) static void vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo, - tree niters, edge update_e) + tree niters, edge update_e, + bool early_exit_p) { gphi_iterator gsi, gsi1; class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); @@ -2400,15 +2414,16 @@ vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo, else ni = vect_peel_nonlinear_iv_init (&stmts, init_expr, niters, step_expr, - induction_type); + induction_type, early_exit_p); var = create_tmp_var (type, "tmp"); gimple_seq new_stmts = NULL; ni_name = force_gimple_operand (ni, &new_stmts, false, var); - /* Exit_bb shouldn't be empty. */ - if (!gsi_end_p (last_gsi)) + /* Exit_bb shouldn't be empty, but we also can't insert after a ctrl + statements. */ + if (!gsi_end_p (last_gsi) && !is_ctrl_stmt (gsi_stmt (last_gsi))) { gsi_insert_seq_after (&last_gsi, stmts, GSI_SAME_STMT); gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT); @@ -2419,8 +2434,15 @@ vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo, gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT); } - /* Fix phi expressions in the successor bb. */ - adjust_phi_and_debug_stmts (phi1, update_e, ni_name); + /* Fix phi expressions in all out of loop bb. */ + imm_use_iterator imm_iter; + gimple *use_stmt; + use_operand_p use_p; + tree ic_var = PHI_ARG_DEF_FROM_EDGE (phi1, update_e); + FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, ic_var) + if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt))) + FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter) + SET_USE (use_p, ni_name); } } @@ -3562,14 +3584,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) update_e = single_succ_edge (LOOP_VINFO_IV_EXIT (loop_vinfo)->dest); - /* If we have a peeled vector iteration, all exits are the same, leave it - and so the main exit needs to be treated the same as the alternative - exits in that we leave their updates to vectorizable_live_operations. - */ - if (!LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)) - vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf, - update_e); - /* If we have a peeled vector iteration we will never skip the epilog loop and we can simplify the cfg a lot by not doing the edge split. */ if (skip_epilog @@ -3625,6 +3639,41 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, scale_loop_profile (epilog, prob_epilog, -1); } + /* If we have a peeled vector iteration, all exits are the same, leave it + and so the main exit needs to be treated the same as the alternative + exits in that we leave their updates to vectorizable_live_operations. + */ + tree vector_iters_vf = niters_vector_mult_vf; + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) + { + tree scal_iv_ty = signed_type_for (TREE_TYPE (vector_iters_vf)); + tree tmp_niters_vf = make_ssa_name (scal_iv_ty); + basic_block exit_bb = NULL; + edge update_e = NULL; + + /* Identify the early exit merge block. I wish we had stored this. */ + for (auto e : get_loop_exit_edges (loop)) + if (e != LOOP_VINFO_IV_EXIT (loop_vinfo)) + { + exit_bb = e->dest; + update_e = single_succ_edge (exit_bb); + break; + } + vect_update_ivs_after_vectorizer (loop_vinfo, tmp_niters_vf, + update_e, true); + + if (LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)) + vector_iters_vf = tmp_niters_vf; + + LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo) = tmp_niters_vf; + } + + bool recalculate_peel_niters_init + = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); + vect_update_ivs_after_vectorizer (loop_vinfo, vector_iters_vf, + update_e, + recalculate_peel_niters_init); + /* Recalculate the dominators after adding the guard edge. */ if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) iterate_fix_dominators (CDI_DOMINATORS, doms, false); diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index ab6c0f084703..3ac264f0ce32 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8951,14 +8951,25 @@ vect_create_nonlinear_iv_init (gimple_seq* stmts, tree init_expr, tree vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree init_expr, tree skip_niters, tree step_expr, - enum vect_induction_op_type induction_type) + enum vect_induction_op_type induction_type, + bool early_exit_p) { - gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST); + gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST || early_exit_p); tree type = TREE_TYPE (init_expr); unsigned prec = TYPE_PRECISION (type); switch (induction_type) { + /* neg inductions are typically not used for loop termination conditions but + are typically implemented as b = -b. That is every scalar iteration b is + negated. That means that for the initial value of b we will have to + determine whether the number of skipped iteration is a multiple of 2 + because every 2 scalar iterations we are back at "b". */ case vect_step_op_neg: + /* For early exits the neg induction will always be the same value at the + start of the iteration. */ + if (early_exit_p) + break; + if (TREE_INT_CST_LOW (skip_niters) % 2) init_expr = gimple_build (stmts, NEGATE_EXPR, type, init_expr); /* else no change. */ @@ -8966,13 +8977,15 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree init_expr, case vect_step_op_shr: case vect_step_op_shl: - skip_niters = gimple_convert (stmts, type, skip_niters); - step_expr = gimple_build (stmts, MULT_EXPR, type, step_expr, skip_niters); + skip_niters = fold_build1 (NOP_EXPR, type, skip_niters); + step_expr = fold_build1 (NOP_EXPR, type, step_expr); + step_expr = fold_build2 (MULT_EXPR, type, step_expr, skip_niters); /* When shift mount >= precision, need to avoid UD. In the original loop, there's no UD, and according to semantic, init_expr should be 0 for lshr, ashl, and >>= (prec - 1) for ashr. */ - if (!tree_fits_uhwi_p (step_expr) + if ((!tree_fits_uhwi_p (step_expr) || tree_to_uhwi (step_expr) >= prec) + && !early_exit_p) { if (induction_type == vect_step_op_shl || TYPE_UNSIGNED (type)) @@ -8983,13 +8996,19 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree init_expr, wide_int_to_tree (type, prec - 1)); } else - init_expr = gimple_build (stmts, (induction_type == vect_step_op_shr + { + init_expr = fold_build2 ((induction_type == vect_step_op_shr ? RSHIFT_EXPR : LSHIFT_EXPR), - type, init_expr, step_expr); + type, init_expr, step_expr); + init_expr = force_gimple_operand (init_expr, stmts, false, NULL); + } break; case vect_step_op_mul: { + /* Due to UB we can't support vect_step_op_mul with early break for now. + so assert and block. */ + gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST); tree utype = unsigned_type_for (type); init_expr = gimple_convert (stmts, utype, init_expr); wide_int skipn = wi::to_wide (skip_niters); @@ -9073,9 +9092,7 @@ vect_update_nonlinear_iv (gimple_seq* stmts, tree vectype, case vect_step_op_mul: { /* Use unsigned mult to avoid UD integer overflow. */ - tree uvectype - = build_vector_type (unsigned_type_for (TREE_TYPE (vectype)), - TYPE_VECTOR_SUBPARTS (vectype)); + tree uvectype = unsigned_type_for (vectype); vec_def = gimple_convert (stmts, uvectype, vec_def); vec_step = gimple_convert (stmts, uvectype, vec_step); vec_def = gimple_build (stmts, MULT_EXPR, uvectype, @@ -9322,7 +9339,7 @@ vectorizable_nonlinear_induction (loop_vec_info loop_vinfo, to adjust the start value here. */ if (niters_skip != NULL_TREE) init_expr = vect_peel_nonlinear_iv_init (&stmts, init_expr, niters_skip, - step_expr, induction_type); + step_expr, induction_type, false); vec_init = vect_create_nonlinear_iv_init (&stmts, init_expr, step_expr, nunits, vectype, @@ -9703,53 +9720,6 @@ vectorizable_induction (loop_vec_info loop_vinfo, LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo)); peel_mul = gimple_build_vector_from_val (&init_stmts, step_vectype, peel_mul); - - /* If early break then we have to create a new PHI which we can use as - an offset to adjust the induction reduction in early exits. - - This is because when peeling for alignment using masking, the first - few elements of the vector can be inactive. As such if we find the - entry in the first iteration we have adjust the starting point of - the scalar code. - - We do this by creating a new scalar PHI that keeps track of whether - we are the first iteration of the loop (with the additional masking) - or whether we have taken a loop iteration already. - - The generated sequence: - - pre-header: - bb1: - i_1 = <number of leading inactive elements> - - header: - bb2: - i_2 = PHI <i_1(bb1), 0(latch)> - … - - early-exit: - bb3: - i_3 = iv_step * i_2 + PHI<vector-iv> - - The first part of the adjustment to create i_1 and i_2 are done here - and the last part creating i_3 is done in - vectorizable_live_operations when the induction extraction is - materialized. */ - if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo) - && !LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo)) - { - auto skip_niters = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo); - tree ty_skip_niters = TREE_TYPE (skip_niters); - tree break_lhs_phi = vect_get_new_vect_var (ty_skip_niters, - vect_scalar_var, - "pfa_iv_offset"); - gphi *nphi = create_phi_node (break_lhs_phi, bb); - add_phi_arg (nphi, skip_niters, pe, UNKNOWN_LOCATION); - add_phi_arg (nphi, build_zero_cst (ty_skip_niters), - loop_latch_edge (iv_loop), UNKNOWN_LOCATION); - - LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo) = PHI_RESULT (nphi); - } } tree step_mul = NULL_TREE; unsigned ivn; @@ -10325,8 +10295,7 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, to the latch then we're restarting the iteration in the scalar loop. So get the first live value. */ bool early_break_first_element_p - = (all_exits_as_early_p || !main_exit_edge) - && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def; + = all_exits_as_early_p || !main_exit_edge; if (early_break_first_element_p) { tmp_vec_lhs = vec_lhs0; @@ -10335,52 +10304,13 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, gimple_stmt_iterator exit_gsi; tree new_tree - = vectorizable_live_operation_1 (loop_vinfo, - e->dest, vectype, - slp_node, bitsize, - tmp_bitstart, tmp_vec_lhs, - lhs_type, &exit_gsi); + = vectorizable_live_operation_1 (loop_vinfo, + e->dest, vectype, + slp_node, bitsize, + tmp_bitstart, tmp_vec_lhs, + lhs_type, &exit_gsi); auto gsi = gsi_for_stmt (use_stmt); - if (early_break_first_element_p - && LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo)) - { - tree step_expr - = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (stmt_info); - tree break_lhs_phi - = LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo); - tree ty_skip_niters = TREE_TYPE (break_lhs_phi); - gimple_seq iv_stmts = NULL; - - /* Now create the PHI for the outside loop usage to - retrieve the value for the offset counter. */ - tree rphi_step - = gimple_convert (&iv_stmts, ty_skip_niters, step_expr); - tree tmp2 - = gimple_build (&iv_stmts, MULT_EXPR, - ty_skip_niters, rphi_step, - break_lhs_phi); - - if (POINTER_TYPE_P (TREE_TYPE (new_tree))) - { - tmp2 = gimple_convert (&iv_stmts, sizetype, tmp2); - tmp2 = gimple_build (&iv_stmts, POINTER_PLUS_EXPR, - TREE_TYPE (new_tree), new_tree, - tmp2); - } - else - { - tmp2 = gimple_convert (&iv_stmts, TREE_TYPE (new_tree), - tmp2); - tmp2 = gimple_build (&iv_stmts, PLUS_EXPR, - TREE_TYPE (new_tree), new_tree, - tmp2); - } - - new_tree = tmp2; - gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT); - } - tree lhs_phi = gimple_phi_result (use_stmt); remove_phi_node (&gsi, false); gimple *copy = gimple_build_assign (lhs_phi, new_tree); @@ -11021,6 +10951,101 @@ move_early_exit_stmts (loop_vec_info loop_vinfo) SET_PHI_ARG_DEF_ON_EDGE (phi, e, last_seen_vuse); } +/* Generate adjustment code for early break scalar IVs filling in the value + we created earlier on for LOOP_VINFO_EARLY_BRK_NITERS_VAR. */ + +static void +vect_update_ivs_after_vectorizer_for_early_breaks (loop_vec_info loop_vinfo) +{ + DUMP_VECT_SCOPE ("vect_update_ivs_after_vectorizer_for_early_breaks"); + + if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) + return; + + gcc_assert (LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo)); + + tree phi_var = LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo); + tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo); + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + tree ty_var = TREE_TYPE (phi_var); + auto loop = LOOP_VINFO_LOOP (loop_vinfo); + tree induc_var = niters_skip ? copy_ssa_name (phi_var) : phi_var; + + auto induction_phi = create_phi_node (induc_var, loop->header); + tree induc_def = PHI_RESULT (induction_phi); + + /* Create the iv update inside the loop. */ + gimple_seq init_stmts = NULL; + gimple_seq stmts = NULL; + gimple_seq iv_stmts = NULL; + tree tree_vf = build_int_cst (ty_var, vf); + + /* For loop len targets we have to use .SELECT_VL (ivtmp_33, VF); instead of + just += VF as the VF can change in between two loop iterations. */ + if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)) + { + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); + tree_vf = vect_get_loop_len (loop_vinfo, NULL, lens, 1, + NULL_TREE, 0, 0); + } + + tree iter_var; + if (POINTER_TYPE_P (ty_var)) + { + tree offset = gimple_convert (&stmts, sizetype, tree_vf); + iter_var = gimple_build (&stmts, POINTER_PLUS_EXPR, ty_var, induc_def, + gimple_convert (&stmts, sizetype, offset)); + } + else + { + tree offset = gimple_convert (&stmts, ty_var, tree_vf); + iter_var = gimple_build (&stmts, PLUS_EXPR, ty_var, induc_def, offset); + } + + tree init_var = build_zero_cst (ty_var); + if (niters_skip) + init_var = gimple_build (&init_stmts, MINUS_EXPR, ty_var, init_var, + gimple_convert (&init_stmts, ty_var, niters_skip)); + + add_phi_arg (induction_phi, iter_var, + loop_latch_edge (loop), UNKNOWN_LOCATION); + add_phi_arg (induction_phi, init_var, + loop_preheader_edge (loop), UNKNOWN_LOCATION); + + /* Find the first insertion point in the BB. */ + auto pe = loop_preheader_edge (loop); + + /* If we've done any peeling, calculate the peeling adjustment needed to the + final IV. */ + if (niters_skip) + { + induc_def = gimple_build (&iv_stmts, MAX_EXPR, TREE_TYPE (induc_def), + induc_def, + build_zero_cst (TREE_TYPE (induc_def))); + auto stmt = gimple_build_assign (phi_var, induc_def); + gimple_seq_add_stmt_without_update (&iv_stmts, stmt); + basic_block exit_bb = NULL; + /* Identify the early exit merge block. I wish we had stored this. */ + for (auto e : get_loop_exit_edges (loop)) + if (e != LOOP_VINFO_IV_EXIT (loop_vinfo)) + { + exit_bb = e->dest; + break; + } + + gcc_assert (exit_bb); + auto exit_gsi = gsi_after_labels (exit_bb); + gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT); + } + /* Write the init_stmts in the loop-preheader block. */ + auto psi = gsi_last_nondebug_bb (pe->src); + gsi_insert_seq_after (&psi, init_stmts, GSI_LAST_NEW_STMT); + /* Wite the adjustments in the header block. */ + basic_block bb = loop->header; + auto si = gsi_after_labels (bb); + gsi_insert_seq_before (&si, stmts, GSI_SAME_STMT); +} + /* Function vect_transform_loop. The analysis phase has determined that the loop is vectorizable. @@ -11165,7 +11190,10 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) /* Handle any code motion that we need to for early-break vectorization after we've done peeling but just before we start vectorizing. */ if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) - move_early_exit_stmts (loop_vinfo); + { + vect_update_ivs_after_vectorizer_for_early_breaks (loop_vinfo); + move_early_exit_stmts (loop_vinfo); + } /* Remove existing clobber stmts and prefetches. */ for (i = 0; i < nbbs; i++) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 5b0de9291cb0..658ad6dc2579 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -5885,48 +5885,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size, "SLP build failed.\n"); } } - - /* Find and create slp instances for inductions that have been forced - live due to early break. */ - edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); - for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo)) - { - vec<stmt_vec_info> stmts; - vec<stmt_vec_info> roots = vNULL; - vec<tree> remain = vNULL; - gphi *phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); - tree def = gimple_phi_arg_def_from_edge (phi, latch_e); - stmt_vec_info lc_info = loop_vinfo->lookup_def (def); - if (lc_info) - { - stmts.create (1); - stmts.quick_push (vect_stmt_to_vectorize (lc_info)); - if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, - stmts, roots, remain, - max_tree_size, &limit, - bst_map, force_single_lane)) - return opt_result::failure_at (vect_location, - "SLP build failed.\n"); - } - /* When the latch def is from a different cycle this can only - be a induction. Build a simple instance for this. - ??? We should be able to start discovery from the PHI - for all inductions, but then there will be stray - non-SLP stmts we choke on as needing non-SLP handling. */ - auto_vec<stmt_vec_info, 1> tem; - tem.quick_push (stmt_info); - if (!bst_map->get (tem)) - { - stmts.create (1); - stmts.quick_push (stmt_info); - if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, - stmts, roots, remain, - max_tree_size, &limit, - bst_map, force_single_lane)) - return opt_result::failure_at (vect_location, - "SLP build failed.\n"); - } - } } hash_set<slp_tree> visited_patterns; diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index de28316ddc66..1d7e50afcde1 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -356,7 +356,6 @@ is_simple_and_all_uses_invariant (stmt_vec_info stmt_info, - it has uses outside the loop. - it has vdefs (it alters memory). - control stmts in the loop (except for the exit condition). - - it is an induction and we have multiple exits. CHECKME: what other side effects would the vectorizer allow? */ @@ -418,26 +417,6 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, } } - /* Check if it's a not live PHI and multiple exits. In this case - there will be a usage later on after peeling which is needed for the - alternate exit. - ??? Unless the PHI was marked live because of early - break, which also needs the latch def live and vectorized. */ - if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo) - && is_a <gphi *> (stmt) - && gimple_bb (stmt) == LOOP_VINFO_LOOP (loop_vinfo)->header - && ((! VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info)) - && ! *live_p) - || STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def)) - { - if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, - "vec_stmt_relevant_p: PHI forced live for " - "early break.\n"); - LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo).safe_push (stmt_info); - *live_p = true; - } - if (*live_p && *relevant == vect_unused_in_scope && !is_simple_and_all_uses_invariant (stmt_info, loop_vinfo)) { @@ -12985,17 +12964,12 @@ can_vectorize_live_stmts (vec_info *vinfo, bool vec_stmt_p, stmt_vector_for_cost *cost_vec) { - loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo); stmt_vec_info slp_stmt_info; unsigned int i; FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (slp_node), i, slp_stmt_info) { if (slp_stmt_info - && (STMT_VINFO_LIVE_P (slp_stmt_info) - || (loop_vinfo - && LOOP_VINFO_EARLY_BREAKS (loop_vinfo) - && STMT_VINFO_DEF_TYPE (slp_stmt_info) - == vect_induction_def)) + && STMT_VINFO_LIVE_P (slp_stmt_info) && !vectorizable_live_operation (vinfo, slp_stmt_info, slp_node, slp_node_instance, i, vec_stmt_p, cost_vec)) diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 5d125afa6bc5..0356b129e36f 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1241,6 +1241,10 @@ public: happen. */ auto_vec<gimple*> early_break_vuses; + /* The IV adjustment value for inductions that needs to be materialized + inside the relavent exit blocks in order to adjust for early break. */ + tree early_break_niters_var; + /* Record statements that are needed to be live for early break vectorization but may not have an LC PHI node materialized yet in the exits. */ auto_vec<stmt_vec_info> early_break_live_ivs; @@ -1308,6 +1312,7 @@ public: (L)->early_break_live_ivs #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses +#define LOOP_VINFO_EARLY_BRK_NITERS_VAR(L) (L)->early_break_niters_var #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds #define LOOP_VINFO_LOOP_IV_COND(L) (L)->loop_iv_cond #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies @@ -2716,7 +2721,8 @@ extern tree cse_and_gimplify_to_preheader (loop_vec_info, tree); /* Nonlinear induction. */ extern tree vect_peel_nonlinear_iv_init (gimple_seq*, tree, tree, - tree, enum vect_induction_op_type); + tree, enum vect_induction_op_type, + bool); /* In tree-vect-slp.cc. */ extern void vect_slp_init (void);
