Re: [PATCH v2][vect]: support vectorization of early break forced live IVs as scalar

Richard Biener Mon, 10 Nov 2025 03:40:57 -0800

On Mon, 10 Nov 2025, Tamar Christina wrote:

> Consider this simple loop
> 
> long long arr[1024];
> long long *f()
> {
>     int i;
>     for (i = 0; i < 1024; i++)
>       if (arr[i] == 42)
>         break;
>     return arr + i;
> }
> 
> where today we generate this at -O3:
> 
> .L2:
>         add     v29.4s, v29.4s, v25.4s
>         add     v28.4s, v28.4s, v26.4s
>         cmp     x2, x1
>         beq     .L9
> .L6:
>         ldp     q30, q31, [x1], 32
>         cmeq    v30.2d, v30.2d, v27.2d
>         cmeq    v31.2d, v31.2d, v27.2d
>         addhn   v31.2s, v31.2d, v30.2d
>         fmov    x3, d31
>         cbz     x3, .L2
> 
> but which is highly inefficient.  This loops has 3 IVs (PR119577), one normal
> scalar one, two vector ones, one counting up and one counting down (PR115120)
> and has a forced unrolling due to an increase in VF because of the mismatch in
> modes between the IVs and the loop body (PR119860).
> 
> This patch fixed all three of these issues and we now generate:
> 
> .L2:
>         add     w2, w2, 2
>         cmp     w2, 1024
>         beq     .L13
> .L5:
>         ldr     q31, [x1]
>         add     x1, x1, 16
>         cmeq    v31.2d, v31.2d, v30.2d
>         umaxp   v31.4s, v31.4s, v31.4s
>         fmov    x0, d31
>         cbz     x0, .L2
> 
> or with sve
> 
> .L3:
>         add     x1, x1, x3
>         whilelo p7.d, w1, w2
>         b.none  .L11
> .L4:
>         ld1d    z30.d, p7/z, [x0, x1, lsl 3]
>         cmpeq   p7.d, p7/z, z30.d, z31.d
>         ptest   p15, p7.b
>         b.none  .L3
> 
> which shows that the new scalar IV is efficiently merged with the loop
> control one based on IVopts.
> 
> To accomplish this the patch reworks how we handle "forced lived inductions"
> with regard to vectorization.
> 
> Prior to this change when we vectorize a loop with early break any induction
> variables would be forced live.  Forcing live means that even though the 
> values
> aren't used inside the loop we must preserve the values such that when we 
> start
> the scalar loop we can pass the correct initial values.
> 
> However this had several side-effects:
> 
> 1. We must be able to vectorize the induction.
> 2. The induction variable participates in VF determination.  This would often
>    times lead to a higher VF than would have normally been needed.  As such 
> the
>    vector loops become less profitable.
> 3. IVcannon on constant loop iterations inserts a downward counting IV in
>    addition to the upwards one in order to support things like doloops.
>    Normally this duplicate IV is removed by IV opts, but IV doesn't understand
>    vector inductions.  As such we end up with 3 IVs.
> 
> This patch fixes all three of these by choosing instead to create a new 
> scalar IV
> that's adjusted within the loop and to update all the IV statements outside 
> the
> loop by using this new IV.
> 
> We re-use vect_update_ivs_after_vectorizer for all exits now and put in a 
> dummy
> value representing the IV that is to be generated later.
> 
> This new scalar IV is then materialized in
> vect_update_ivs_after_vectorizer_for_early_breaks.  When PFA using masks by 
> skipping
> iterations we now roll up the pfa IV into the new scalar IV by adjusting the 
> first
> iteration back from start - niters_peel and then take the MAX <scal_iv, 0> to
> correctly handle the first iteration.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and some issues
> 
> First issue is that there's a latent bug exposed by this patch in that this 
> example
> 
> integer (8) b, c
>   integer  d
>   c = 10
>   d = 2
>   call e ((/ (b, b = j, c, d) /), 0_8, c, d + 0_8)
> contains
>   subroutine e (a, f, g, h)
>     integer (8), dimension (:) :: a
>     integer (8)  f, g, h
>     i = 1
>     do b = f, g, h
>       if (a (i) .ne. b) STOP
>       i = i + 1
>     end do
>     if (size (a) .ne. i - 1) STOP 2
>   end
> end
> 
> Has an execution failure because the wrong value is put for the split exit 
> edge on the
> main exit (this is the same as the test array_constructor_12.f90 in the 
> testsuite).
> 
> vect_update_ivs_after_vectorizer calculates the loop iterated 7 times, while 
> it would
> have done 6, and when we create the edge that skips the scalar loop in
> 
>       /* If we have a peeled vector iteration we will never skip the epilog 
> loop
>        and we can simplify the cfg a lot by not doing the edge split.  */
>       if (skip_epilog
>         || (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
>             && !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)))
> 
> we don't copy the value from the scalar loop guard that 
> vect_update_ivs_after_vectorizer 
> calculated, so we further propagate the issue.
> 
> On x86_64 I also seem to have a miscompile if something during stage2, which 
> cases
> the vectorizer to create invalid permutations when vectorizing the 
> vect-simd-11.c and
> vect-simd-15.c cases.
> 
> I'm not entirely sure the two are related or not, since I can't see anything 
> else
> specifically wrong in this patch.
> 
> So sending this up since Richi wanted to take a look at the first bug above 
> and to
> see if this approach is now in line with the desire.
> 
> Because we are now re-using vect_update_ivs_after_vectorizer we have an issue 
> with
> UB clamping on non-linear inductions.
> 
> At the moment when doing early exit updating I just ignore the possibility of 
> UB
> since if the main exit is OK, the early exit is one iteration behind the main 
> one
> and so should be ok.
> 
> Things however get complicated with PEELED loops.


OK.

I'll have a look at the latent issue.  I'll note for a future TODO
that we'd like to have a way to represent a transitionally and
"VARYING" SSA name definition so we can get rid of the GENERIC
building and re-gimplification.  I'll note that gimple_build stops
at stmts in the IL, so inserting some random definition would have
worked as well (but is wrong in a similar way of course).

Thanks,
Richard.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       PR tree-optimization/115120
>       PR tree-optimization/119577
>       PR tree-optimization/119860
>       * tree-vect-loop-manip.cc (vect_can_advance_ivs_p): Check for nonlinear
>       mult induction and early break.
>       (vect_update_ivs_after_vectorizer): Support early break exits.
>       (vect_do_peeling): Support scalar IVs.
>       * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Support early break.
>       (vect_update_nonlinear_iv): use `unsigned_type_for` such that function
>       works for both vector and scalar types.
>       (vectorizable_induction, vectorizable_live_operation): Remove vector
>       early break IV code.
>       (vect_update_ivs_after_vectorizer_for_early_breaks): New.
>       (vect_transform_loop): Support new scalar IV for early break.
>       * tree-vect-slp.cc (vect_analyze_slp): Remove SLP build for early break
>       IVs.
>       * tree-vect-stmts.cc (vect_stmt_relevant_p): No longer mark early break
>       IVs as completely unused rather than used_only_live.  They no longer
>       contribute to the vector loop and so should not be analyzed.
>       (can_vectorize_live_stmts): Remove vector early vreak IV code.
>       * tree-vectorizer.h (LOOP_VINFO_EARLY_BRK_NITERS_VAR): New.
>       (class loop_vec_info): Add early_break_niters_var.
> 
> gcc/testsuite/ChangeLog:
> 
>       PR tree-optimization/115120
>       PR tree-optimization/119577
>       PR tree-optimization/119860
>       * gcc.dg/vect/vect-early-break_39.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_10.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_11.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_12.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_5.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_6.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_7.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_9.c: Update.
> 
> ---
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c 
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> index 
> b3f40b8c9ba49e41bd283e46a462238c3b5825ef..bc862ad20e68db8f3c0ba6facf47e13a56a7cd6d
>  100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> @@ -23,5 +23,6 @@ unsigned test4(unsigned x, unsigned n)
>   return ret;
>  }
>  
> -/* cannot safely vectorize this due due to the group misalignment.  */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0 
> "vect" } } */
> +/* AArch64 will scalarize the load and is able to vectorize it.  */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 1 
> "vect" { target aarch64*-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0 
> "vect" { target { ! aarch64*-*-* } } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> index 
> b7a7bc5cb0cfdfdb74adb120c54ba15019832cf1..43abd01c078da7d3f80045ecbd37b72ac918f678
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> @@ -20,5 +20,4 @@ foo (int start)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> index 
> feb7ee7d61c92145e8defc095f2ad096b1e3f777..37806adea7b9788d3122fa32148a8709d5cf57be
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> @@ -15,6 +15,5 @@ foo (int *a) {
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
>  /* { dg-final { scan-assembler {\tnot\tp[0-7]\.b, p[0-7]/z, p.*\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> index 
> 260482a94df750b7886d72eed1964e70288c0886..e3ed63afb05cbef15d3c58a18acb0f3650161223
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> @@ -15,7 +15,6 @@ foo (int *restrict a, int * restrict b) {
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Both peeling and versioning will be applied" 
> "vect" } } */
>  /* { dg-final { scan-assembler {\tnot\tp[0-7]\.b, p[0-7]/z, p.*\n} } } */
>  /* { dg-final { scan-assembler {\teor\t.*\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> index 
> a03bb1dec21ef75aa0cbfb22c8bb02b99644239e..1977bf3af2db247825900c4200676f4dc2ca4f9a
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> @@ -20,5 +20,4 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> index 
> 9bfd1a65c4feb0c140d4abf98508fc8af08042ba..0b40d26ae2a3f3c882a7e571140f9efabcf9c41a
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> @@ -20,5 +20,4 @@ foo (int start)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> index 
> 0182e131a173b7b05e88c3393ba854b2da25c6b2..7a24d689e95a65aa65e1ec6558d117d19407a2c6
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> @@ -20,5 +20,4 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> index 
> cc904e88170f072e1d3c6be86643d99a7cd5cb12..136d18c2ea89f5a93a1edfc24fe8b7f97bae82d8
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> @@ -20,6 +20,6 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* Peels using a scalar loop.  */
> -/* { dg-final { scan-tree-dump-not "pfa_iv_offset" "vect" } } */
> +/* Peels using fully masked loop.  */
> +/* { dg-final { scan-tree-dump "misalignment for fully-masked loop" "vect" } 
> } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> index 
> 85aab355f95f83e1fa65d280f14fb8ade7f7e658..15dd2d8f45fabcd9b0ec4ef5f4dc83b2b0692822
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> @@ -34,6 +34,5 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
>  
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 
> 6af07efe68ac5e52e76d0c96f94886774cd1dd40..f645d446e3d2a8db4f2583391c7e6016bf665d1e
>  100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -2159,6 +2159,16 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
>        return false;
>      }
>  
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> +      && induction_type == vect_step_op_mul)
> +    {
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                      "Peeling for is not supported for nonlinear mult"
> +                      " induction using partial vectorization.\n");
> +      return false;
> +    }
> +
>    /* Avoid compile time hog on vect_peel_nonlinear_iv_init.  */
>    if (induction_type == vect_step_op_mul)
>      {
> @@ -2313,6 +2323,9 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo)
>                    The phi args associated with the edge UPDATE_E in the bb
>                    UPDATE_E->dest are updated accordingly.
>  
> +     - EARLY_EXIT_P - Indicates whether the exit is an early exit rather than
> +                   the main latch exit.
> +
>       Assumption 1: Like the rest of the vectorizer, this function assumes
>       a single loop exit that has a single predecessor.
>  
> @@ -2331,7 +2344,8 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo)
>  
>  static void
>  vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo,
> -                               tree niters, edge update_e)
> +                               tree niters, edge update_e,
> +                               bool early_exit_p)
>  {
>    gphi_iterator gsi, gsi1;
>    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> @@ -2398,7 +2412,7 @@ vect_update_ivs_after_vectorizer (loop_vec_info 
> loop_vinfo,
>        else
>       ni = vect_peel_nonlinear_iv_init (&stmts, init_expr,
>                                         niters, step_expr,
> -                                       induction_type);
> +                                       induction_type, early_exit_p);
>  
>        var = create_tmp_var (type, "tmp");
>  
> @@ -3568,9 +3582,36 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>        and so the main exit needs to be treated the same as the alternative
>        exits in that we leave their updates to vectorizable_live_operations.
>        */
> -      if (!LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
> -     vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> -                                       update_e);
> +      tree vector_iters_vf = niters_vector_mult_vf;
> +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +     {
> +       tree scal_iv_ty = signed_type_for (TREE_TYPE (vector_iters_vf));
> +       tree tmp_niters_vf = make_ssa_name (scal_iv_ty);
> +       basic_block exit_bb = NULL;
> +       edge update_e = NULL;
> +
> +       /* Identify the early exit merge block.  I wish we had stored this.  
> */
> +       for (auto e : get_loop_exit_edges (loop))
> +         if (e != LOOP_VINFO_IV_EXIT (loop_vinfo))
> +           {
> +             exit_bb = e->dest;
> +             update_e = single_succ_edge (exit_bb);
> +             break;
> +           }
> +       vect_update_ivs_after_vectorizer (loop_vinfo, tmp_niters_vf,
> +                                         update_e, true);
> +
> +       if (LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
> +         vector_iters_vf = tmp_niters_vf;
> +
> +       LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo) = tmp_niters_vf;
> +     }
> +
> +     bool recalculate_peel_niters_init
> +       = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
> +     vect_update_ivs_after_vectorizer (loop_vinfo, vector_iters_vf,
> +                                       update_e,
> +                                       recalculate_peel_niters_init);
>  
>        /* If we have a peeled vector iteration we will never skip the epilog 
> loop
>        and we can simplify the cfg a lot by not doing the edge split.  */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 
> 50cdc2a90fa29c1e0d116c0589bc246e6d8fcc84..91a1833011fc867c21d4018a90f27d1bf9283ca0
>  100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -8934,14 +8934,25 @@ vect_create_nonlinear_iv_init (gimple_seq* stmts, 
> tree init_expr,
>  tree
>  vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree init_expr,
>                            tree skip_niters, tree step_expr,
> -                          enum vect_induction_op_type induction_type)
> +                          enum vect_induction_op_type induction_type,
> +                          bool early_exit_p)
>  {
> -  gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST);
> +  gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST || early_exit_p);
>    tree type = TREE_TYPE (init_expr);
>    unsigned prec = TYPE_PRECISION (type);
>    switch (induction_type)
>      {
> +    /* neg inductions are typically not used for loop termination conditions 
> but
> +       are typically implemented as b = -b.  That is every scalar iteration 
> b is
> +       negated.  That means that for the initial value of b we will have to
> +       determine whether the number of skipped iteration is a multiple of 2
> +       because every 2 scalar iterations we are back at "b".  */
>      case vect_step_op_neg:
> +      /* For early exits the neg induction will always be the same value at 
> the
> +      start of the iteration.  */
> +      if (early_exit_p)
> +     break;
> +
>        if (TREE_INT_CST_LOW (skip_niters) % 2)
>       init_expr = gimple_build (stmts, NEGATE_EXPR, type, init_expr);
>        /* else no change.  */
> @@ -8949,13 +8960,15 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree 
> init_expr,
>  
>      case vect_step_op_shr:
>      case vect_step_op_shl:
> -      skip_niters = gimple_convert (stmts, type, skip_niters);
> -      step_expr = gimple_build (stmts, MULT_EXPR, type, step_expr, 
> skip_niters);
> +      skip_niters = fold_build1 (NOP_EXPR, type, skip_niters);
> +      step_expr = fold_build1 (NOP_EXPR, type, step_expr);
> +      step_expr = fold_build2 (MULT_EXPR, type, step_expr, skip_niters);
>        /* When shift mount >= precision, need to avoid UD.
>        In the original loop, there's no UD, and according to semantic,
>        init_expr should be 0 for lshr, ashl, and >>= (prec - 1) for ashr.  */
> -      if (!tree_fits_uhwi_p (step_expr)
> +      if ((!tree_fits_uhwi_p (step_expr)
>         || tree_to_uhwi (step_expr) >= prec)
> +       && !early_exit_p)
>       {
>         if (induction_type == vect_step_op_shl
>             || TYPE_UNSIGNED (type))
> @@ -8966,13 +8979,19 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree 
> init_expr,
>                                     wide_int_to_tree (type, prec - 1));
>       }
>        else
> -     init_expr = gimple_build (stmts, (induction_type == vect_step_op_shr
> +     {
> +       init_expr = fold_build2 ((induction_type == vect_step_op_shr
>                                         ? RSHIFT_EXPR : LSHIFT_EXPR),
> -                               type, init_expr, step_expr);
> +                                 type, init_expr, step_expr);
> +       init_expr = force_gimple_operand (init_expr, stmts, false, NULL);
> +     }
>        break;
>  
>      case vect_step_op_mul:
>        {
> +     /* Due to UB we can't support vect_step_op_mul with early break for now.
> +        so assert and block.  */
> +     gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST);
>       tree utype = unsigned_type_for (type);
>       init_expr = gimple_convert (stmts, utype, init_expr);
>       wide_int skipn = wi::to_wide (skip_niters);
> @@ -9056,9 +9075,7 @@ vect_update_nonlinear_iv (gimple_seq* stmts, tree 
> vectype,
>      case vect_step_op_mul:
>        {
>       /* Use unsigned mult to avoid UD integer overflow.  */
> -     tree uvectype
> -       = build_vector_type (unsigned_type_for (TREE_TYPE (vectype)),
> -                            TYPE_VECTOR_SUBPARTS (vectype));
> +     tree uvectype = unsigned_type_for (vectype);
>       vec_def = gimple_convert (stmts, uvectype, vec_def);
>       vec_step = gimple_convert (stmts, uvectype, vec_step);
>       vec_def = gimple_build (stmts, MULT_EXPR, uvectype,
> @@ -9305,7 +9322,7 @@ vectorizable_nonlinear_induction (loop_vec_info 
> loop_vinfo,
>       to adjust the start value here.  */
>    if (niters_skip != NULL_TREE)
>      init_expr = vect_peel_nonlinear_iv_init (&stmts, init_expr, niters_skip,
> -                                          step_expr, induction_type);
> +                                          step_expr, induction_type, false);
>  
>    vec_init = vect_create_nonlinear_iv_init (&stmts, init_expr,
>                                           step_expr, nunits, vectype,
> @@ -9686,53 +9703,6 @@ vectorizable_induction (loop_vec_info loop_vinfo,
>                                  LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo));
>        peel_mul = gimple_build_vector_from_val (&init_stmts,
>                                              step_vectype, peel_mul);
> -
> -      /* If early break then we have to create a new PHI which we can use as
> -      an offset to adjust the induction reduction in early exits.
> -
> -      This is because when peeling for alignment using masking, the first
> -      few elements of the vector can be inactive.  As such if we find the
> -      entry in the first iteration we have adjust the starting point of
> -      the scalar code.
> -
> -      We do this by creating a new scalar PHI that keeps track of whether
> -      we are the first iteration of the loop (with the additional masking)
> -      or whether we have taken a loop iteration already.
> -
> -      The generated sequence:
> -
> -      pre-header:
> -        bb1:
> -          i_1 = <number of leading inactive elements>
> -
> -        header:
> -        bb2:
> -          i_2 = PHI <i_1(bb1), 0(latch)>
> -          …
> -
> -        early-exit:
> -        bb3:
> -          i_3 = iv_step * i_2 + PHI<vector-iv>
> -
> -      The first part of the adjustment to create i_1 and i_2 are done here
> -      and the last part creating i_3 is done in
> -      vectorizable_live_operations when the induction extraction is
> -      materialized.  */
> -      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> -       && !LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> -     {
> -       auto skip_niters = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> -       tree ty_skip_niters = TREE_TYPE (skip_niters);
> -       tree break_lhs_phi = vect_get_new_vect_var (ty_skip_niters,
> -                                                   vect_scalar_var,
> -                                                   "pfa_iv_offset");
> -       gphi *nphi = create_phi_node (break_lhs_phi, bb);
> -       add_phi_arg (nphi, skip_niters, pe, UNKNOWN_LOCATION);
> -       add_phi_arg (nphi, build_zero_cst (ty_skip_niters),
> -                    loop_latch_edge (iv_loop), UNKNOWN_LOCATION);
> -
> -       LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo) = PHI_RESULT (nphi);
> -     }
>      }
>    tree step_mul = NULL_TREE;
>    unsigned ivn;
> @@ -10308,8 +10278,7 @@ vectorizable_live_operation (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>                to the latch then we're restarting the iteration in the
>                scalar loop.  So get the first live value.  */
>             bool early_break_first_element_p
> -             = (all_exits_as_early_p || !main_exit_edge)
> -                && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def;
> +             = all_exits_as_early_p || !main_exit_edge;
>             if (early_break_first_element_p)
>               {
>                 tmp_vec_lhs = vec_lhs0;
> @@ -10318,52 +10287,13 @@ vectorizable_live_operation (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>  
>             gimple_stmt_iterator exit_gsi;
>             tree new_tree
> -             = vectorizable_live_operation_1 (loop_vinfo,
> -                                              e->dest, vectype,
> -                                              slp_node, bitsize,
> -                                              tmp_bitstart, tmp_vec_lhs,
> -                                              lhs_type, &exit_gsi);
> +               = vectorizable_live_operation_1 (loop_vinfo,
> +                                                e->dest, vectype,
> +                                                slp_node, bitsize,
> +                                                tmp_bitstart, tmp_vec_lhs,
> +                                                lhs_type, &exit_gsi);
>  
>             auto gsi = gsi_for_stmt (use_stmt);
> -           if (early_break_first_element_p
> -               && LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> -             {
> -               tree step_expr
> -                 = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (stmt_info);
> -               tree break_lhs_phi
> -                 = LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo);
> -               tree ty_skip_niters = TREE_TYPE (break_lhs_phi);
> -               gimple_seq iv_stmts = NULL;
> -
> -               /* Now create the PHI for the outside loop usage to
> -                  retrieve the value for the offset counter.  */
> -               tree rphi_step
> -                 = gimple_convert (&iv_stmts, ty_skip_niters, step_expr);
> -               tree tmp2
> -                 = gimple_build (&iv_stmts, MULT_EXPR,
> -                                 ty_skip_niters, rphi_step,
> -                                 break_lhs_phi);
> -
> -               if (POINTER_TYPE_P (TREE_TYPE (new_tree)))
> -                 {
> -                   tmp2 = gimple_convert (&iv_stmts, sizetype, tmp2);
> -                   tmp2 = gimple_build (&iv_stmts, POINTER_PLUS_EXPR,
> -                                        TREE_TYPE (new_tree), new_tree,
> -                                        tmp2);
> -                 }
> -               else
> -                 {
> -                   tmp2 = gimple_convert (&iv_stmts, TREE_TYPE (new_tree),
> -                                          tmp2);
> -                   tmp2 = gimple_build (&iv_stmts, PLUS_EXPR,
> -                                        TREE_TYPE (new_tree), new_tree,
> -                                        tmp2);
> -                 }
> -
> -               new_tree = tmp2;
> -               gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT);
> -             }
> -
>             tree lhs_phi = gimple_phi_result (use_stmt);
>             remove_phi_node (&gsi, false);
>             gimple *copy = gimple_build_assign (lhs_phi, new_tree);
> @@ -11004,6 +10934,105 @@ move_early_exit_stmts (loop_vec_info loop_vinfo)
>       SET_PHI_ARG_DEF_ON_EDGE (phi, e, last_seen_vuse);
>  }
>  
> +/* Generate adjustment code for early break scalar IVs filling in the value
> +   we created earlier on for LOOP_VINFO_EARLY_BRK_NITERS_VAR.  */
> +
> +static void
> +vect_update_ivs_after_vectorizer_for_early_breaks (loop_vec_info loop_vinfo)
> +{
> +  DUMP_VECT_SCOPE ("move_early_exit_stmts");
> +
> +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    return;
> +
> +  gcc_assert (LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo));
> +
> +  tree phi_var = LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo);
> +  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  tree ty_var = TREE_TYPE (phi_var);
> +  auto loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree induc_var = niters_skip ? copy_ssa_name (phi_var) : phi_var;
> +
> +  auto induction_phi = create_phi_node (induc_var, loop->header);
> +  tree induc_def = PHI_RESULT (induction_phi);
> +
> +  /* Create the iv update inside the loop.  */
> +  gimple_seq init_stmts = NULL;
> +  gimple_seq stmts = NULL;
> +  gimple_seq iv_stmts = NULL;
> +  tree tree_vf = build_int_cst (ty_var, vf);
> +
> +  /* For loop len targets we have to use .SELECT_VL (ivtmp_33, VF); instead 
> of
> +     just += VF as the VF can change in between two loop iterations.  */
> +  if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
> +    {
> +      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +      tree_vf = vect_get_loop_len (loop_vinfo, NULL, lens, 1,
> +                                NULL_TREE, 0, 0);
> +    }
> +
> +   tree iter_var;
> +  if (SCALAR_FLOAT_TYPE_P (ty_var))
> +    tree_vf = gimple_convert (&init_stmts, ty_var, tree_vf);
> +
> +  if (POINTER_TYPE_P (ty_var))
> +    {
> +      tree offset = gimple_convert (&stmts, sizetype, tree_vf);
> +      iter_var = gimple_build (&stmts, POINTER_PLUS_EXPR, ty_var, induc_def,
> +                            gimple_convert (&stmts, sizetype, offset));
> +    }
> +  else
> +    {
> +      tree offset = gimple_convert (&stmts, ty_var, tree_vf);
> +      iter_var = gimple_build (&stmts, PLUS_EXPR, ty_var, induc_def, offset);
> +    }
> +
> +  tree init_var = build_zero_cst (ty_var);
> +  if (niters_skip)
> +    init_var = gimple_build (&init_stmts, MINUS_EXPR, ty_var, init_var,
> +                          gimple_convert (&init_stmts, ty_var, niters_skip));
> +
> +  add_phi_arg (induction_phi, iter_var,
> +            loop_latch_edge (loop), UNKNOWN_LOCATION);
> +  add_phi_arg (induction_phi, init_var,
> +            loop_preheader_edge (loop), UNKNOWN_LOCATION);
> +
> +  /* Find the first insertion point in the BB.  */
> +  auto pe = loop_preheader_edge (loop);
> +
> +  /* If we've done any peeling, calculate the peeling adjustment needed to 
> the
> +     final IV.  */
> +  if (niters_skip)
> +    {
> +      induc_def = gimple_build (&iv_stmts, MAX_EXPR, TREE_TYPE (induc_def),
> +                             induc_def,
> +                             build_zero_cst (TREE_TYPE (induc_def)));
> +      auto stmt = gimple_build_assign (phi_var, induc_def);
> +      gimple_seq_add_stmt_without_update (&iv_stmts, stmt);
> +      basic_block exit_bb = NULL;
> +      /* Identify the early exit merge block.  I wish we had stored this.  */
> +      for (auto e : get_loop_exit_edges (loop))
> +     if (e != LOOP_VINFO_IV_EXIT (loop_vinfo))
> +       {
> +         exit_bb = e->dest;
> +         break;
> +       }
> +
> +      gcc_assert (exit_bb);
> +      auto exit_gsi = gsi_after_labels (exit_bb);
> +      gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT);
> +  }
> +
> +  /* Write the init_stmts in the loop-preheader block.  */
> +  auto psi = gsi_last_nondebug_bb (pe->src);
> +  gsi_insert_seq_after (&psi, init_stmts, GSI_LAST_NEW_STMT);
> +  /* Wite the adjustments in the header block.  */
> +  basic_block bb = loop->header;
> +  auto si = gsi_after_labels (bb);
> +  gsi_insert_seq_before (&si, stmts, GSI_SAME_STMT);
> +}
> +
>  /* Function vect_transform_loop.
>  
>     The analysis phase has determined that the loop is vectorizable.
> @@ -11148,7 +11177,10 @@ vect_transform_loop (loop_vec_info loop_vinfo, 
> gimple *loop_vectorized_call)
>    /* Handle any code motion that we need to for early-break vectorization 
> after
>       we've done peeling but just before we start vectorizing.  */
>    if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> -    move_early_exit_stmts (loop_vinfo);
> +    {
> +      vect_update_ivs_after_vectorizer_for_early_breaks (loop_vinfo);
> +      move_early_exit_stmts (loop_vinfo);
> +    }
>  
>    /* Remove existing clobber stmts and prefetches.  */
>    for (i = 0; i < nbbs; i++)
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 
> aa6c3e2e041da26d2e5314b9edd0c5b934a9dd1d..c6b8e93c335616e4cab295a56ead5ca99bca8e82
>  100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -5733,48 +5733,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
> max_tree_size,
>                                            "SLP build failed.\n");
>           }
>       }
> -
> -     /* Find and create slp instances for inductions that have been forced
> -        live due to early break.  */
> -     edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo));
> -     for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo))
> -       {
> -         vec<stmt_vec_info> stmts;
> -         vec<stmt_vec_info> roots = vNULL;
> -         vec<tree> remain = vNULL;
> -         gphi *phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info));
> -         tree def = gimple_phi_arg_def_from_edge (phi, latch_e);
> -         stmt_vec_info lc_info = loop_vinfo->lookup_def (def);
> -         if (lc_info)
> -           {
> -             stmts.create (1);
> -             stmts.quick_push (vect_stmt_to_vectorize (lc_info));
> -             if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group,
> -                                            stmts, roots, remain,
> -                                            max_tree_size, &limit,
> -                                            bst_map, force_single_lane))
> -               return opt_result::failure_at (vect_location,
> -                                              "SLP build failed.\n");
> -           }
> -         /* When the latch def is from a different cycle this can only
> -            be a induction.  Build a simple instance for this.
> -            ???  We should be able to start discovery from the PHI
> -            for all inductions, but then there will be stray
> -            non-SLP stmts we choke on as needing non-SLP handling.  */
> -         auto_vec<stmt_vec_info, 1> tem;
> -         tem.quick_push (stmt_info);
> -         if (!bst_map->get (tem))
> -           {
> -             stmts.create (1);
> -             stmts.quick_push (stmt_info);
> -             if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group,
> -                                            stmts, roots, remain,
> -                                            max_tree_size, &limit,
> -                                            bst_map, force_single_lane))
> -               return opt_result::failure_at (vect_location,
> -                                              "SLP build failed.\n");
> -           }
> -       }
>      }
>  
>    hash_set<slp_tree> visited_patterns;
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 
> 83acbb3ff67ccdd4a39606850a23f483d6a4b1fb..eb46f1e86575dc72cead4b214af02a31c0d336ff
>  100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -356,7 +356,6 @@ is_simple_and_all_uses_invariant (stmt_vec_info stmt_info,
>     - it has uses outside the loop.
>     - it has vdefs (it alters memory).
>     - control stmts in the loop (except for the exit condition).
> -   - it is an induction and we have multiple exits.
>  
>     CHECKME: what other side effects would the vectorizer allow?  */
>  
> @@ -420,9 +419,8 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, 
> loop_vec_info loop_vinfo,
>  
>    /* Check if it's a not live PHI and multiple exits.  In this case
>       there will be a usage later on after peeling which is needed for the
> -     alternate exit.
> -     ???  Unless the PHI was marked live because of early
> -     break, which also needs the latch def live and vectorized.  */
> +     alternate exit.  Explicitly force it to be live but irrelevant for
> +     vectorization, otherwise we will still analize it for VF usage.  */
>    if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
>        && is_a <gphi *> (stmt)
>        && gimple_bb (stmt) == LOOP_VINFO_LOOP (loop_vinfo)->header
> @@ -432,10 +430,10 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, 
> loop_vec_info loop_vinfo,
>      {
>        if (dump_enabled_p ())
>       dump_printf_loc (MSG_NOTE, vect_location,
> -                      "vec_stmt_relevant_p: PHI forced live for "
> -                      "early break.\n");
> -      LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo).safe_push (stmt_info);
> +                      "vec_stmt_relevant_p: PHI live but not relevant due to"
> +                      " early break.\n");
>        *live_p = true;
> +      return true;
>      }
>  
>    if (*live_p && *relevant == vect_unused_in_scope
> @@ -12750,17 +12748,12 @@ can_vectorize_live_stmts (vec_info *vinfo,
>                         bool vec_stmt_p,
>                         stmt_vector_for_cost *cost_vec)
>  {
> -  loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
>    stmt_vec_info slp_stmt_info;
>    unsigned int i;
>    FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (slp_node), i, slp_stmt_info)
>      {
>        if (slp_stmt_info
> -       && (STMT_VINFO_LIVE_P (slp_stmt_info)
> -           || (loop_vinfo
> -               && LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> -               && STMT_VINFO_DEF_TYPE (slp_stmt_info)
> -               == vect_induction_def))
> +       && STMT_VINFO_LIVE_P (slp_stmt_info)
>         && !vectorizable_live_operation (vinfo, slp_stmt_info, slp_node,
>                                          slp_node_instance, i,
>                                          vec_stmt_p, cost_vec))
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 
> 359c994139b211d3b4ac54d08890465f04ef107a..5362da5f2b40f556dd7d81b99a3b7b763f4ab296
>  100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -1233,6 +1233,10 @@ public:
>       happen.  */
>    auto_vec<gimple*> early_break_vuses;
>  
> +  /* The IV adjustment value for inductions that needs to be materialized
> +     inside the relavent exit blocks in order to adjust for early break.  */
> +  tree early_break_niters_var;
> +
>    /* Record statements that are needed to be live for early break 
> vectorization
>       but may not have an LC PHI node materialized yet in the exits.  */
>    auto_vec<stmt_vec_info> early_break_live_ivs;
> @@ -1302,6 +1306,7 @@ public:
>    (L)->early_break_live_ivs
>  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)->early_break_dest_bb
>  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> +#define LOOP_VINFO_EARLY_BRK_NITERS_VAR(L) (L)->early_break_niters_var
>  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
>  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
>  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
> @@ -2712,7 +2717,8 @@ extern tree cse_and_gimplify_to_preheader 
> (loop_vec_info, tree);
>  
>  /* Nonlinear induction.  */
>  extern tree vect_peel_nonlinear_iv_init (gimple_seq*, tree, tree,
> -                                      tree, enum vect_induction_op_type);
> +                                      tree, enum vect_induction_op_type,
> +                                      bool);
>  
>  /* In tree-vect-slp.cc.  */
>  extern void vect_slp_init (void);
> 
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH v2][vect]: support vectorization of early break forced live IVs as scalar

Reply via email to