On Mon, 10 Nov 2025, Tamar Christina wrote:
> Consider this simple loop
>
> long long arr[1024];
> long long *f()
> {
> int i;
> for (i = 0; i < 1024; i++)
> if (arr[i] == 42)
> break;
> return arr + i;
> }
>
> where today we generate this at -O3:
>
> .L2:
> add v29.4s, v29.4s, v25.4s
> add v28.4s, v28.4s, v26.4s
> cmp x2, x1
> beq .L9
> .L6:
> ldp q30, q31, [x1], 32
> cmeq v30.2d, v30.2d, v27.2d
> cmeq v31.2d, v31.2d, v27.2d
> addhn v31.2s, v31.2d, v30.2d
> fmov x3, d31
> cbz x3, .L2
>
> but which is highly inefficient. This loops has 3 IVs (PR119577), one normal
> scalar one, two vector ones, one counting up and one counting down (PR115120)
> and has a forced unrolling due to an increase in VF because of the mismatch in
> modes between the IVs and the loop body (PR119860).
>
> This patch fixed all three of these issues and we now generate:
>
> .L2:
> add w2, w2, 2
> cmp w2, 1024
> beq .L13
> .L5:
> ldr q31, [x1]
> add x1, x1, 16
> cmeq v31.2d, v31.2d, v30.2d
> umaxp v31.4s, v31.4s, v31.4s
> fmov x0, d31
> cbz x0, .L2
>
> or with sve
>
> .L3:
> add x1, x1, x3
> whilelo p7.d, w1, w2
> b.none .L11
> .L4:
> ld1d z30.d, p7/z, [x0, x1, lsl 3]
> cmpeq p7.d, p7/z, z30.d, z31.d
> ptest p15, p7.b
> b.none .L3
>
> which shows that the new scalar IV is efficiently merged with the loop
> control one based on IVopts.
>
> To accomplish this the patch reworks how we handle "forced lived inductions"
> with regard to vectorization.
>
> Prior to this change when we vectorize a loop with early break any induction
> variables would be forced live. Forcing live means that even though the
> values
> aren't used inside the loop we must preserve the values such that when we
> start
> the scalar loop we can pass the correct initial values.
>
> However this had several side-effects:
>
> 1. We must be able to vectorize the induction.
> 2. The induction variable participates in VF determination. This would often
> times lead to a higher VF than would have normally been needed. As such
> the
> vector loops become less profitable.
> 3. IVcannon on constant loop iterations inserts a downward counting IV in
> addition to the upwards one in order to support things like doloops.
> Normally this duplicate IV is removed by IV opts, but IV doesn't understand
> vector inductions. As such we end up with 3 IVs.
>
> This patch fixes all three of these by choosing instead to create a new
> scalar IV
> that's adjusted within the loop and to update all the IV statements outside
> the
> loop by using this new IV.
>
> We re-use vect_update_ivs_after_vectorizer for all exits now and put in a
> dummy
> value representing the IV that is to be generated later.
>
> This new scalar IV is then materialized in
> vect_update_ivs_after_vectorizer_for_early_breaks. When PFA using masks by
> skipping
> iterations we now roll up the pfa IV into the new scalar IV by adjusting the
> first
> iteration back from start - niters_peel and then take the MAX <scal_iv, 0> to
> correctly handle the first iteration.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and some issues
>
> First issue is that there's a latent bug exposed by this patch in that this
> example
>
> integer (8) b, c
> integer d
> c = 10
> d = 2
> call e ((/ (b, b = j, c, d) /), 0_8, c, d + 0_8)
> contains
> subroutine e (a, f, g, h)
> integer (8), dimension (:) :: a
> integer (8) f, g, h
> i = 1
> do b = f, g, h
> if (a (i) .ne. b) STOP
> i = i + 1
> end do
> if (size (a) .ne. i - 1) STOP 2
> end
> end
>
> Has an execution failure because the wrong value is put for the split exit
> edge on the
> main exit (this is the same as the test array_constructor_12.f90 in the
> testsuite).
>
> vect_update_ivs_after_vectorizer calculates the loop iterated 7 times, while
> it would
> have done 6, and when we create the edge that skips the scalar loop in
>
> /* If we have a peeled vector iteration we will never skip the epilog
> loop
> and we can simplify the cfg a lot by not doing the edge split. */
> if (skip_epilog
> || (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> && !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)))
>
> we don't copy the value from the scalar loop guard that
> vect_update_ivs_after_vectorizer
> calculated, so we further propagate the issue.
>
> On x86_64 I also seem to have a miscompile if something during stage2, which
> cases
> the vectorizer to create invalid permutations when vectorizing the
> vect-simd-11.c and
> vect-simd-15.c cases.
>
> I'm not entirely sure the two are related or not, since I can't see anything
> else
> specifically wrong in this patch.
>
> So sending this up since Richi wanted to take a look at the first bug above
> and to
> see if this approach is now in line with the desire.
>
> Because we are now re-using vect_update_ivs_after_vectorizer we have an issue
> with
> UB clamping on non-linear inductions.
>
> At the moment when doing early exit updating I just ignore the possibility of
> UB
> since if the main exit is OK, the early exit is one iteration behind the main
> one
> and so should be ok.
>
> Things however get complicated with PEELED loops.
OK.
I'll have a look at the latent issue. I'll note for a future TODO
that we'd like to have a way to represent a transitionally and
"VARYING" SSA name definition so we can get rid of the GENERIC
building and re-gimplification. I'll note that gimple_build stops
at stmts in the IL, so inserting some random definition would have
worked as well (but is wrong in a similar way of course).
Thanks,
Richard.
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> PR tree-optimization/115120
> PR tree-optimization/119577
> PR tree-optimization/119860
> * tree-vect-loop-manip.cc (vect_can_advance_ivs_p): Check for nonlinear
> mult induction and early break.
> (vect_update_ivs_after_vectorizer): Support early break exits.
> (vect_do_peeling): Support scalar IVs.
> * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Support early break.
> (vect_update_nonlinear_iv): use `unsigned_type_for` such that function
> works for both vector and scalar types.
> (vectorizable_induction, vectorizable_live_operation): Remove vector
> early break IV code.
> (vect_update_ivs_after_vectorizer_for_early_breaks): New.
> (vect_transform_loop): Support new scalar IV for early break.
> * tree-vect-slp.cc (vect_analyze_slp): Remove SLP build for early break
> IVs.
> * tree-vect-stmts.cc (vect_stmt_relevant_p): No longer mark early break
> IVs as completely unused rather than used_only_live. They no longer
> contribute to the vector loop and so should not be analyzed.
> (can_vectorize_live_stmts): Remove vector early vreak IV code.
> * tree-vectorizer.h (LOOP_VINFO_EARLY_BRK_NITERS_VAR): New.
> (class loop_vec_info): Add early_break_niters_var.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/115120
> PR tree-optimization/119577
> PR tree-optimization/119860
> * gcc.dg/vect/vect-early-break_39.c: Update.
> * gcc.target/aarch64/sve/peel_ind_10.c: Update.
> * gcc.target/aarch64/sve/peel_ind_11.c: Update.
> * gcc.target/aarch64/sve/peel_ind_12.c: Update.
> * gcc.target/aarch64/sve/peel_ind_5.c: Update.
> * gcc.target/aarch64/sve/peel_ind_6.c: Update.
> * gcc.target/aarch64/sve/peel_ind_7.c: Update.
> * gcc.target/aarch64/sve/peel_ind_9.c: Update.
>
> ---
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> index
> b3f40b8c9ba49e41bd283e46a462238c3b5825ef..bc862ad20e68db8f3c0ba6facf47e13a56a7cd6d
> 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> @@ -23,5 +23,6 @@ unsigned test4(unsigned x, unsigned n)
> return ret;
> }
>
> -/* cannot safely vectorize this due due to the group misalignment. */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0
> "vect" } } */
> +/* AArch64 will scalarize the load and is able to vectorize it. */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 1
> "vect" { target aarch64*-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0
> "vect" { target { ! aarch64*-*-* } } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> index
> b7a7bc5cb0cfdfdb74adb120c54ba15019832cf1..43abd01c078da7d3f80045ecbd37b72ac918f678
> 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> @@ -20,5 +20,4 @@ foo (int start)
> }
>
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
> /* { dg-final { scan-tree-dump "Alignment of access forced using peeling"
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> index
> feb7ee7d61c92145e8defc095f2ad096b1e3f777..37806adea7b9788d3122fa32148a8709d5cf57be
> 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> @@ -15,6 +15,5 @@ foo (int *a) {
> }
>
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
> /* { dg-final { scan-tree-dump "Alignment of access forced using peeling"
> "vect" } } */
> /* { dg-final { scan-assembler {\tnot\tp[0-7]\.b, p[0-7]/z, p.*\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> index
> 260482a94df750b7886d72eed1964e70288c0886..e3ed63afb05cbef15d3c58a18acb0f3650161223
> 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> @@ -15,7 +15,6 @@ foo (int *restrict a, int * restrict b) {
> }
>
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
> /* { dg-final { scan-tree-dump "Both peeling and versioning will be applied"
> "vect" } } */
> /* { dg-final { scan-assembler {\tnot\tp[0-7]\.b, p[0-7]/z, p.*\n} } } */
> /* { dg-final { scan-assembler {\teor\t.*\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> index
> a03bb1dec21ef75aa0cbfb22c8bb02b99644239e..1977bf3af2db247825900c4200676f4dc2ca4f9a
> 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> @@ -20,5 +20,4 @@ foo (void)
> }
>
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
> /* { dg-final { scan-tree-dump "Alignment of access forced using peeling"
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> index
> 9bfd1a65c4feb0c140d4abf98508fc8af08042ba..0b40d26ae2a3f3c882a7e571140f9efabcf9c41a
> 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> @@ -20,5 +20,4 @@ foo (int start)
> }
>
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
> /* { dg-final { scan-tree-dump "Alignment of access forced using peeling"
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> index
> 0182e131a173b7b05e88c3393ba854b2da25c6b2..7a24d689e95a65aa65e1ec6558d117d19407a2c6
> 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> @@ -20,5 +20,4 @@ foo (void)
> }
>
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
> /* { dg-final { scan-tree-dump "Alignment of access forced using peeling"
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> index
> cc904e88170f072e1d3c6be86643d99a7cd5cb12..136d18c2ea89f5a93a1edfc24fe8b7f97bae82d8
> 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> @@ -20,6 +20,6 @@ foo (void)
> }
>
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* Peels using a scalar loop. */
> -/* { dg-final { scan-tree-dump-not "pfa_iv_offset" "vect" } } */
> +/* Peels using fully masked loop. */
> +/* { dg-final { scan-tree-dump "misalignment for fully-masked loop" "vect" }
> } */
> /* { dg-final { scan-tree-dump "Alignment of access forced using peeling"
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> index
> 85aab355f95f83e1fa65d280f14fb8ade7f7e658..15dd2d8f45fabcd9b0ec4ef5f4dc83b2b0692822
> 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> @@ -34,6 +34,5 @@ foo (void)
> }
>
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
> /* { dg-final { scan-tree-dump "Alignment of access forced using peeling"
> "vect" } } */
>
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index
> 6af07efe68ac5e52e76d0c96f94886774cd1dd40..f645d446e3d2a8db4f2583391c7e6016bf665d1e
> 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -2159,6 +2159,16 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
> return false;
> }
>
> + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> + && induction_type == vect_step_op_mul)
> + {
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> + "Peeling for is not supported for nonlinear mult"
> + " induction using partial vectorization.\n");
> + return false;
> + }
> +
> /* Avoid compile time hog on vect_peel_nonlinear_iv_init. */
> if (induction_type == vect_step_op_mul)
> {
> @@ -2313,6 +2323,9 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo)
> The phi args associated with the edge UPDATE_E in the bb
> UPDATE_E->dest are updated accordingly.
>
> + - EARLY_EXIT_P - Indicates whether the exit is an early exit rather than
> + the main latch exit.
> +
> Assumption 1: Like the rest of the vectorizer, this function assumes
> a single loop exit that has a single predecessor.
>
> @@ -2331,7 +2344,8 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo)
>
> static void
> vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo,
> - tree niters, edge update_e)
> + tree niters, edge update_e,
> + bool early_exit_p)
> {
> gphi_iterator gsi, gsi1;
> class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> @@ -2398,7 +2412,7 @@ vect_update_ivs_after_vectorizer (loop_vec_info
> loop_vinfo,
> else
> ni = vect_peel_nonlinear_iv_init (&stmts, init_expr,
> niters, step_expr,
> - induction_type);
> + induction_type, early_exit_p);
>
> var = create_tmp_var (type, "tmp");
>
> @@ -3568,9 +3582,36 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> niters, tree nitersm1,
> and so the main exit needs to be treated the same as the alternative
> exits in that we leave their updates to vectorizable_live_operations.
> */
> - if (!LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
> - vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> - update_e);
> + tree vector_iters_vf = niters_vector_mult_vf;
> + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> + {
> + tree scal_iv_ty = signed_type_for (TREE_TYPE (vector_iters_vf));
> + tree tmp_niters_vf = make_ssa_name (scal_iv_ty);
> + basic_block exit_bb = NULL;
> + edge update_e = NULL;
> +
> + /* Identify the early exit merge block. I wish we had stored this.
> */
> + for (auto e : get_loop_exit_edges (loop))
> + if (e != LOOP_VINFO_IV_EXIT (loop_vinfo))
> + {
> + exit_bb = e->dest;
> + update_e = single_succ_edge (exit_bb);
> + break;
> + }
> + vect_update_ivs_after_vectorizer (loop_vinfo, tmp_niters_vf,
> + update_e, true);
> +
> + if (LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
> + vector_iters_vf = tmp_niters_vf;
> +
> + LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo) = tmp_niters_vf;
> + }
> +
> + bool recalculate_peel_niters_init
> + = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
> + vect_update_ivs_after_vectorizer (loop_vinfo, vector_iters_vf,
> + update_e,
> + recalculate_peel_niters_init);
>
> /* If we have a peeled vector iteration we will never skip the epilog
> loop
> and we can simplify the cfg a lot by not doing the edge split. */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index
> 50cdc2a90fa29c1e0d116c0589bc246e6d8fcc84..91a1833011fc867c21d4018a90f27d1bf9283ca0
> 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -8934,14 +8934,25 @@ vect_create_nonlinear_iv_init (gimple_seq* stmts,
> tree init_expr,
> tree
> vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree init_expr,
> tree skip_niters, tree step_expr,
> - enum vect_induction_op_type induction_type)
> + enum vect_induction_op_type induction_type,
> + bool early_exit_p)
> {
> - gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST);
> + gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST || early_exit_p);
> tree type = TREE_TYPE (init_expr);
> unsigned prec = TYPE_PRECISION (type);
> switch (induction_type)
> {
> + /* neg inductions are typically not used for loop termination conditions
> but
> + are typically implemented as b = -b. That is every scalar iteration
> b is
> + negated. That means that for the initial value of b we will have to
> + determine whether the number of skipped iteration is a multiple of 2
> + because every 2 scalar iterations we are back at "b". */
> case vect_step_op_neg:
> + /* For early exits the neg induction will always be the same value at
> the
> + start of the iteration. */
> + if (early_exit_p)
> + break;
> +
> if (TREE_INT_CST_LOW (skip_niters) % 2)
> init_expr = gimple_build (stmts, NEGATE_EXPR, type, init_expr);
> /* else no change. */
> @@ -8949,13 +8960,15 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree
> init_expr,
>
> case vect_step_op_shr:
> case vect_step_op_shl:
> - skip_niters = gimple_convert (stmts, type, skip_niters);
> - step_expr = gimple_build (stmts, MULT_EXPR, type, step_expr,
> skip_niters);
> + skip_niters = fold_build1 (NOP_EXPR, type, skip_niters);
> + step_expr = fold_build1 (NOP_EXPR, type, step_expr);
> + step_expr = fold_build2 (MULT_EXPR, type, step_expr, skip_niters);
> /* When shift mount >= precision, need to avoid UD.
> In the original loop, there's no UD, and according to semantic,
> init_expr should be 0 for lshr, ashl, and >>= (prec - 1) for ashr. */
> - if (!tree_fits_uhwi_p (step_expr)
> + if ((!tree_fits_uhwi_p (step_expr)
> || tree_to_uhwi (step_expr) >= prec)
> + && !early_exit_p)
> {
> if (induction_type == vect_step_op_shl
> || TYPE_UNSIGNED (type))
> @@ -8966,13 +8979,19 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree
> init_expr,
> wide_int_to_tree (type, prec - 1));
> }
> else
> - init_expr = gimple_build (stmts, (induction_type == vect_step_op_shr
> + {
> + init_expr = fold_build2 ((induction_type == vect_step_op_shr
> ? RSHIFT_EXPR : LSHIFT_EXPR),
> - type, init_expr, step_expr);
> + type, init_expr, step_expr);
> + init_expr = force_gimple_operand (init_expr, stmts, false, NULL);
> + }
> break;
>
> case vect_step_op_mul:
> {
> + /* Due to UB we can't support vect_step_op_mul with early break for now.
> + so assert and block. */
> + gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST);
> tree utype = unsigned_type_for (type);
> init_expr = gimple_convert (stmts, utype, init_expr);
> wide_int skipn = wi::to_wide (skip_niters);
> @@ -9056,9 +9075,7 @@ vect_update_nonlinear_iv (gimple_seq* stmts, tree
> vectype,
> case vect_step_op_mul:
> {
> /* Use unsigned mult to avoid UD integer overflow. */
> - tree uvectype
> - = build_vector_type (unsigned_type_for (TREE_TYPE (vectype)),
> - TYPE_VECTOR_SUBPARTS (vectype));
> + tree uvectype = unsigned_type_for (vectype);
> vec_def = gimple_convert (stmts, uvectype, vec_def);
> vec_step = gimple_convert (stmts, uvectype, vec_step);
> vec_def = gimple_build (stmts, MULT_EXPR, uvectype,
> @@ -9305,7 +9322,7 @@ vectorizable_nonlinear_induction (loop_vec_info
> loop_vinfo,
> to adjust the start value here. */
> if (niters_skip != NULL_TREE)
> init_expr = vect_peel_nonlinear_iv_init (&stmts, init_expr, niters_skip,
> - step_expr, induction_type);
> + step_expr, induction_type, false);
>
> vec_init = vect_create_nonlinear_iv_init (&stmts, init_expr,
> step_expr, nunits, vectype,
> @@ -9686,53 +9703,6 @@ vectorizable_induction (loop_vec_info loop_vinfo,
> LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo));
> peel_mul = gimple_build_vector_from_val (&init_stmts,
> step_vectype, peel_mul);
> -
> - /* If early break then we have to create a new PHI which we can use as
> - an offset to adjust the induction reduction in early exits.
> -
> - This is because when peeling for alignment using masking, the first
> - few elements of the vector can be inactive. As such if we find the
> - entry in the first iteration we have adjust the starting point of
> - the scalar code.
> -
> - We do this by creating a new scalar PHI that keeps track of whether
> - we are the first iteration of the loop (with the additional masking)
> - or whether we have taken a loop iteration already.
> -
> - The generated sequence:
> -
> - pre-header:
> - bb1:
> - i_1 = <number of leading inactive elements>
> -
> - header:
> - bb2:
> - i_2 = PHI <i_1(bb1), 0(latch)>
> - …
> -
> - early-exit:
> - bb3:
> - i_3 = iv_step * i_2 + PHI<vector-iv>
> -
> - The first part of the adjustment to create i_1 and i_2 are done here
> - and the last part creating i_3 is done in
> - vectorizable_live_operations when the induction extraction is
> - materialized. */
> - if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> - && !LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> - {
> - auto skip_niters = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> - tree ty_skip_niters = TREE_TYPE (skip_niters);
> - tree break_lhs_phi = vect_get_new_vect_var (ty_skip_niters,
> - vect_scalar_var,
> - "pfa_iv_offset");
> - gphi *nphi = create_phi_node (break_lhs_phi, bb);
> - add_phi_arg (nphi, skip_niters, pe, UNKNOWN_LOCATION);
> - add_phi_arg (nphi, build_zero_cst (ty_skip_niters),
> - loop_latch_edge (iv_loop), UNKNOWN_LOCATION);
> -
> - LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo) = PHI_RESULT (nphi);
> - }
> }
> tree step_mul = NULL_TREE;
> unsigned ivn;
> @@ -10308,8 +10278,7 @@ vectorizable_live_operation (vec_info *vinfo,
> stmt_vec_info stmt_info,
> to the latch then we're restarting the iteration in the
> scalar loop. So get the first live value. */
> bool early_break_first_element_p
> - = (all_exits_as_early_p || !main_exit_edge)
> - && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def;
> + = all_exits_as_early_p || !main_exit_edge;
> if (early_break_first_element_p)
> {
> tmp_vec_lhs = vec_lhs0;
> @@ -10318,52 +10287,13 @@ vectorizable_live_operation (vec_info *vinfo,
> stmt_vec_info stmt_info,
>
> gimple_stmt_iterator exit_gsi;
> tree new_tree
> - = vectorizable_live_operation_1 (loop_vinfo,
> - e->dest, vectype,
> - slp_node, bitsize,
> - tmp_bitstart, tmp_vec_lhs,
> - lhs_type, &exit_gsi);
> + = vectorizable_live_operation_1 (loop_vinfo,
> + e->dest, vectype,
> + slp_node, bitsize,
> + tmp_bitstart, tmp_vec_lhs,
> + lhs_type, &exit_gsi);
>
> auto gsi = gsi_for_stmt (use_stmt);
> - if (early_break_first_element_p
> - && LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> - {
> - tree step_expr
> - = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (stmt_info);
> - tree break_lhs_phi
> - = LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo);
> - tree ty_skip_niters = TREE_TYPE (break_lhs_phi);
> - gimple_seq iv_stmts = NULL;
> -
> - /* Now create the PHI for the outside loop usage to
> - retrieve the value for the offset counter. */
> - tree rphi_step
> - = gimple_convert (&iv_stmts, ty_skip_niters, step_expr);
> - tree tmp2
> - = gimple_build (&iv_stmts, MULT_EXPR,
> - ty_skip_niters, rphi_step,
> - break_lhs_phi);
> -
> - if (POINTER_TYPE_P (TREE_TYPE (new_tree)))
> - {
> - tmp2 = gimple_convert (&iv_stmts, sizetype, tmp2);
> - tmp2 = gimple_build (&iv_stmts, POINTER_PLUS_EXPR,
> - TREE_TYPE (new_tree), new_tree,
> - tmp2);
> - }
> - else
> - {
> - tmp2 = gimple_convert (&iv_stmts, TREE_TYPE (new_tree),
> - tmp2);
> - tmp2 = gimple_build (&iv_stmts, PLUS_EXPR,
> - TREE_TYPE (new_tree), new_tree,
> - tmp2);
> - }
> -
> - new_tree = tmp2;
> - gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT);
> - }
> -
> tree lhs_phi = gimple_phi_result (use_stmt);
> remove_phi_node (&gsi, false);
> gimple *copy = gimple_build_assign (lhs_phi, new_tree);
> @@ -11004,6 +10934,105 @@ move_early_exit_stmts (loop_vec_info loop_vinfo)
> SET_PHI_ARG_DEF_ON_EDGE (phi, e, last_seen_vuse);
> }
>
> +/* Generate adjustment code for early break scalar IVs filling in the value
> + we created earlier on for LOOP_VINFO_EARLY_BRK_NITERS_VAR. */
> +
> +static void
> +vect_update_ivs_after_vectorizer_for_early_breaks (loop_vec_info loop_vinfo)
> +{
> + DUMP_VECT_SCOPE ("move_early_exit_stmts");
> +
> + if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> + return;
> +
> + gcc_assert (LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo));
> +
> + tree phi_var = LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo);
> + tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> + tree ty_var = TREE_TYPE (phi_var);
> + auto loop = LOOP_VINFO_LOOP (loop_vinfo);
> + tree induc_var = niters_skip ? copy_ssa_name (phi_var) : phi_var;
> +
> + auto induction_phi = create_phi_node (induc_var, loop->header);
> + tree induc_def = PHI_RESULT (induction_phi);
> +
> + /* Create the iv update inside the loop. */
> + gimple_seq init_stmts = NULL;
> + gimple_seq stmts = NULL;
> + gimple_seq iv_stmts = NULL;
> + tree tree_vf = build_int_cst (ty_var, vf);
> +
> + /* For loop len targets we have to use .SELECT_VL (ivtmp_33, VF); instead
> of
> + just += VF as the VF can change in between two loop iterations. */
> + if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
> + {
> + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> + tree_vf = vect_get_loop_len (loop_vinfo, NULL, lens, 1,
> + NULL_TREE, 0, 0);
> + }
> +
> + tree iter_var;
> + if (SCALAR_FLOAT_TYPE_P (ty_var))
> + tree_vf = gimple_convert (&init_stmts, ty_var, tree_vf);
> +
> + if (POINTER_TYPE_P (ty_var))
> + {
> + tree offset = gimple_convert (&stmts, sizetype, tree_vf);
> + iter_var = gimple_build (&stmts, POINTER_PLUS_EXPR, ty_var, induc_def,
> + gimple_convert (&stmts, sizetype, offset));
> + }
> + else
> + {
> + tree offset = gimple_convert (&stmts, ty_var, tree_vf);
> + iter_var = gimple_build (&stmts, PLUS_EXPR, ty_var, induc_def, offset);
> + }
> +
> + tree init_var = build_zero_cst (ty_var);
> + if (niters_skip)
> + init_var = gimple_build (&init_stmts, MINUS_EXPR, ty_var, init_var,
> + gimple_convert (&init_stmts, ty_var, niters_skip));
> +
> + add_phi_arg (induction_phi, iter_var,
> + loop_latch_edge (loop), UNKNOWN_LOCATION);
> + add_phi_arg (induction_phi, init_var,
> + loop_preheader_edge (loop), UNKNOWN_LOCATION);
> +
> + /* Find the first insertion point in the BB. */
> + auto pe = loop_preheader_edge (loop);
> +
> + /* If we've done any peeling, calculate the peeling adjustment needed to
> the
> + final IV. */
> + if (niters_skip)
> + {
> + induc_def = gimple_build (&iv_stmts, MAX_EXPR, TREE_TYPE (induc_def),
> + induc_def,
> + build_zero_cst (TREE_TYPE (induc_def)));
> + auto stmt = gimple_build_assign (phi_var, induc_def);
> + gimple_seq_add_stmt_without_update (&iv_stmts, stmt);
> + basic_block exit_bb = NULL;
> + /* Identify the early exit merge block. I wish we had stored this. */
> + for (auto e : get_loop_exit_edges (loop))
> + if (e != LOOP_VINFO_IV_EXIT (loop_vinfo))
> + {
> + exit_bb = e->dest;
> + break;
> + }
> +
> + gcc_assert (exit_bb);
> + auto exit_gsi = gsi_after_labels (exit_bb);
> + gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT);
> + }
> +
> + /* Write the init_stmts in the loop-preheader block. */
> + auto psi = gsi_last_nondebug_bb (pe->src);
> + gsi_insert_seq_after (&psi, init_stmts, GSI_LAST_NEW_STMT);
> + /* Wite the adjustments in the header block. */
> + basic_block bb = loop->header;
> + auto si = gsi_after_labels (bb);
> + gsi_insert_seq_before (&si, stmts, GSI_SAME_STMT);
> +}
> +
> /* Function vect_transform_loop.
>
> The analysis phase has determined that the loop is vectorizable.
> @@ -11148,7 +11177,10 @@ vect_transform_loop (loop_vec_info loop_vinfo,
> gimple *loop_vectorized_call)
> /* Handle any code motion that we need to for early-break vectorization
> after
> we've done peeling but just before we start vectorizing. */
> if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> - move_early_exit_stmts (loop_vinfo);
> + {
> + vect_update_ivs_after_vectorizer_for_early_breaks (loop_vinfo);
> + move_early_exit_stmts (loop_vinfo);
> + }
>
> /* Remove existing clobber stmts and prefetches. */
> for (i = 0; i < nbbs; i++)
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index
> aa6c3e2e041da26d2e5314b9edd0c5b934a9dd1d..c6b8e93c335616e4cab295a56ead5ca99bca8e82
> 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -5733,48 +5733,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned
> max_tree_size,
> "SLP build failed.\n");
> }
> }
> -
> - /* Find and create slp instances for inductions that have been forced
> - live due to early break. */
> - edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo));
> - for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo))
> - {
> - vec<stmt_vec_info> stmts;
> - vec<stmt_vec_info> roots = vNULL;
> - vec<tree> remain = vNULL;
> - gphi *phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info));
> - tree def = gimple_phi_arg_def_from_edge (phi, latch_e);
> - stmt_vec_info lc_info = loop_vinfo->lookup_def (def);
> - if (lc_info)
> - {
> - stmts.create (1);
> - stmts.quick_push (vect_stmt_to_vectorize (lc_info));
> - if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group,
> - stmts, roots, remain,
> - max_tree_size, &limit,
> - bst_map, force_single_lane))
> - return opt_result::failure_at (vect_location,
> - "SLP build failed.\n");
> - }
> - /* When the latch def is from a different cycle this can only
> - be a induction. Build a simple instance for this.
> - ??? We should be able to start discovery from the PHI
> - for all inductions, but then there will be stray
> - non-SLP stmts we choke on as needing non-SLP handling. */
> - auto_vec<stmt_vec_info, 1> tem;
> - tem.quick_push (stmt_info);
> - if (!bst_map->get (tem))
> - {
> - stmts.create (1);
> - stmts.quick_push (stmt_info);
> - if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group,
> - stmts, roots, remain,
> - max_tree_size, &limit,
> - bst_map, force_single_lane))
> - return opt_result::failure_at (vect_location,
> - "SLP build failed.\n");
> - }
> - }
> }
>
> hash_set<slp_tree> visited_patterns;
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index
> 83acbb3ff67ccdd4a39606850a23f483d6a4b1fb..eb46f1e86575dc72cead4b214af02a31c0d336ff
> 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -356,7 +356,6 @@ is_simple_and_all_uses_invariant (stmt_vec_info stmt_info,
> - it has uses outside the loop.
> - it has vdefs (it alters memory).
> - control stmts in the loop (except for the exit condition).
> - - it is an induction and we have multiple exits.
>
> CHECKME: what other side effects would the vectorizer allow? */
>
> @@ -420,9 +419,8 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> loop_vec_info loop_vinfo,
>
> /* Check if it's a not live PHI and multiple exits. In this case
> there will be a usage later on after peeling which is needed for the
> - alternate exit.
> - ??? Unless the PHI was marked live because of early
> - break, which also needs the latch def live and vectorized. */
> + alternate exit. Explicitly force it to be live but irrelevant for
> + vectorization, otherwise we will still analize it for VF usage. */
> if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> && is_a <gphi *> (stmt)
> && gimple_bb (stmt) == LOOP_VINFO_LOOP (loop_vinfo)->header
> @@ -432,10 +430,10 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> loop_vec_info loop_vinfo,
> {
> if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location,
> - "vec_stmt_relevant_p: PHI forced live for "
> - "early break.\n");
> - LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo).safe_push (stmt_info);
> + "vec_stmt_relevant_p: PHI live but not relevant due to"
> + " early break.\n");
> *live_p = true;
> + return true;
> }
>
> if (*live_p && *relevant == vect_unused_in_scope
> @@ -12750,17 +12748,12 @@ can_vectorize_live_stmts (vec_info *vinfo,
> bool vec_stmt_p,
> stmt_vector_for_cost *cost_vec)
> {
> - loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
> stmt_vec_info slp_stmt_info;
> unsigned int i;
> FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (slp_node), i, slp_stmt_info)
> {
> if (slp_stmt_info
> - && (STMT_VINFO_LIVE_P (slp_stmt_info)
> - || (loop_vinfo
> - && LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> - && STMT_VINFO_DEF_TYPE (slp_stmt_info)
> - == vect_induction_def))
> + && STMT_VINFO_LIVE_P (slp_stmt_info)
> && !vectorizable_live_operation (vinfo, slp_stmt_info, slp_node,
> slp_node_instance, i,
> vec_stmt_p, cost_vec))
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index
> 359c994139b211d3b4ac54d08890465f04ef107a..5362da5f2b40f556dd7d81b99a3b7b763f4ab296
> 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -1233,6 +1233,10 @@ public:
> happen. */
> auto_vec<gimple*> early_break_vuses;
>
> + /* The IV adjustment value for inductions that needs to be materialized
> + inside the relavent exit blocks in order to adjust for early break. */
> + tree early_break_niters_var;
> +
> /* Record statements that are needed to be live for early break
> vectorization
> but may not have an LC PHI node materialized yet in the exits. */
> auto_vec<stmt_vec_info> early_break_live_ivs;
> @@ -1302,6 +1306,7 @@ public:
> (L)->early_break_live_ivs
> #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb
> #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses
> +#define LOOP_VINFO_EARLY_BRK_NITERS_VAR(L) (L)->early_break_niters_var
> #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds
> #define LOOP_VINFO_LOOP_IV_COND(L) (L)->loop_iv_cond
> #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
> @@ -2712,7 +2717,8 @@ extern tree cse_and_gimplify_to_preheader
> (loop_vec_info, tree);
>
> /* Nonlinear induction. */
> extern tree vect_peel_nonlinear_iv_init (gimple_seq*, tree, tree,
> - tree, enum vect_induction_op_type);
> + tree, enum vect_induction_op_type,
> + bool);
>
> /* In tree-vect-slp.cc. */
> extern void vect_slp_init (void);
>
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)