Re: [PATCH v2][vect]: support vectorization of early break forced live IVs as scalar

Richard Biener Mon, 10 Nov 2025 04:58:59 -0800

On Mon, 10 Nov 2025, Tamar Christina wrote:

> First issue is that there's a latent bug exposed by this patch in that 
> this example
> 
> integer (8) b, c
>   integer  d
>   c = 10
>   d = 2
>   call e ((/ (b, b = j, c, d) /), 0_8, c, d + 0_8)
> contains
>   subroutine e (a, f, g, h)
>     integer (8), dimension (:) :: a
>     integer (8)  f, g, h
>     i = 1
>     do b = f, g, h
>       if (a (i) .ne. b) STOP
>       i = i + 1
>     end do
>     if (size (a) .ne. i - 1) STOP 2
>   end
> end


less UB variant and more non-inline friendly:

  integer (8) b
  integer  c, d
  c = 10
  d = 2
  call e ((/ (b, b = 0, c, d) /))
contains
  subroutine e (a)
    integer (8), contiguous, dimension (:) :: a
    integer i
    i = 1
    do b = 0, 10, 2
      if (a (i) .ne. b) STOP
      i = i + 1
    end do
    if (size (a) .ne. i - 1) STOP 2
  end
end

I can reproduce on x86_64 with -O2 -msse4.2 -fno-vect-cost-model as well.
With -fallow-store-data-races it also fails w/o inlining, with
-fno-ipa-sra -fno-ipa-cp you also get close to initial code for 'e'.

What I see is that on the path skipping the epilog loop from the
vector main IV exit we're refering to the original scalar IV for 'i',
that will likely keep it live, and that's the value we're using and
it looks correct to me ...

C testcase:

void __attribute__((noipa))
e (long *p, int n)
{
  int i = 0;
  for (long b = 0; b <= 5; b++)
    {
      if (p[i])
        __builtin_abort ();
      i++;
    }
  if (n != i - 1)
    __builtin_abort ();
}

long a[6];
int main()
{
  e (a, 5);
}

fails with -O2 -msse4.2 -fno-vect-cost-model.

Richard.

> Has an execution failure because the wrong value is put for the split exit 
> edge on the
> main exit (this is the same as the test array_constructor_12.f90 in the 
> testsuite).
> 
> vect_update_ivs_after_vectorizer calculates the loop iterated 7 times, while 
> it would
> have done 6, and when we create the edge that skips the scalar loop in
> 
>       /* If we have a peeled vector iteration we will never skip the epilog 
> loop
>        and we can simplify the cfg a lot by not doing the edge split.  */
>       if (skip_epilog
>         || (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
>             && !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)))
> 
> we don't copy the value from the scalar loop guard that 
> vect_update_ivs_after_vectorizer 
> calculated, so we further propagate the issue.
> 
> On x86_64 I also seem to have a miscompile if something during stage2, which 
> cases
> the vectorizer to create invalid permutations when vectorizing the 
> vect-simd-11.c and
> vect-simd-15.c cases.
> 
> I'm not entirely sure the two are related or not, since I can't see anything 
> else
> specifically wrong in this patch.
> 
> So sending this up since Richi wanted to take a look at the first bug above 
> and to
> see if this approach is now in line with the desire.
> 
> Because we are now re-using vect_update_ivs_after_vectorizer we have an issue 
> with
> UB clamping on non-linear inductions.
> 
> At the moment when doing early exit updating I just ignore the possibility of 
> UB
> since if the main exit is OK, the early exit is one iteration behind the main 
> one
> and so should be ok.
> 
> Things however get complicated with PEELED loops.
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       PR tree-optimization/115120
>       PR tree-optimization/119577
>       PR tree-optimization/119860
>       * tree-vect-loop-manip.cc (vect_can_advance_ivs_p): Check for nonlinear
>       mult induction and early break.
>       (vect_update_ivs_after_vectorizer): Support early break exits.
>       (vect_do_peeling): Support scalar IVs.
>       * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Support early break.
>       (vect_update_nonlinear_iv): use `unsigned_type_for` such that function
>       works for both vector and scalar types.
>       (vectorizable_induction, vectorizable_live_operation): Remove vector
>       early break IV code.
>       (vect_update_ivs_after_vectorizer_for_early_breaks): New.
>       (vect_transform_loop): Support new scalar IV for early break.
>       * tree-vect-slp.cc (vect_analyze_slp): Remove SLP build for early break
>       IVs.
>       * tree-vect-stmts.cc (vect_stmt_relevant_p): No longer mark early break
>       IVs as completely unused rather than used_only_live.  They no longer
>       contribute to the vector loop and so should not be analyzed.
>       (can_vectorize_live_stmts): Remove vector early vreak IV code.
>       * tree-vectorizer.h (LOOP_VINFO_EARLY_BRK_NITERS_VAR): New.
>       (class loop_vec_info): Add early_break_niters_var.
> 
> gcc/testsuite/ChangeLog:
> 
>       PR tree-optimization/115120
>       PR tree-optimization/119577
>       PR tree-optimization/119860
>       * gcc.dg/vect/vect-early-break_39.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_10.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_11.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_12.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_5.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_6.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_7.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_9.c: Update.
> 
> ---
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c 
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> index 
> b3f40b8c9ba49e41bd283e46a462238c3b5825ef..bc862ad20e68db8f3c0ba6facf47e13a56a7cd6d
>  100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> @@ -23,5 +23,6 @@ unsigned test4(unsigned x, unsigned n)
>   return ret;
>  }
>  
> -/* cannot safely vectorize this due due to the group misalignment.  */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0 
> "vect" } } */
> +/* AArch64 will scalarize the load and is able to vectorize it.  */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 1 
> "vect" { target aarch64*-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0 
> "vect" { target { ! aarch64*-*-* } } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> index 
> b7a7bc5cb0cfdfdb74adb120c54ba15019832cf1..43abd01c078da7d3f80045ecbd37b72ac918f678
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_10.c
> @@ -20,5 +20,4 @@ foo (int start)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> index 
> feb7ee7d61c92145e8defc095f2ad096b1e3f777..37806adea7b9788d3122fa32148a8709d5cf57be
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_11.c
> @@ -15,6 +15,5 @@ foo (int *a) {
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
>  /* { dg-final { scan-assembler {\tnot\tp[0-7]\.b, p[0-7]/z, p.*\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> index 
> 260482a94df750b7886d72eed1964e70288c0886..e3ed63afb05cbef15d3c58a18acb0f3650161223
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_12.c
> @@ -15,7 +15,6 @@ foo (int *restrict a, int * restrict b) {
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Both peeling and versioning will be applied" 
> "vect" } } */
>  /* { dg-final { scan-assembler {\tnot\tp[0-7]\.b, p[0-7]/z, p.*\n} } } */
>  /* { dg-final { scan-assembler {\teor\t.*\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> index 
> a03bb1dec21ef75aa0cbfb22c8bb02b99644239e..1977bf3af2db247825900c4200676f4dc2ca4f9a
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_5.c
> @@ -20,5 +20,4 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> index 
> 9bfd1a65c4feb0c140d4abf98508fc8af08042ba..0b40d26ae2a3f3c882a7e571140f9efabcf9c41a
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_6.c
> @@ -20,5 +20,4 @@ foo (int start)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> index 
> 0182e131a173b7b05e88c3393ba854b2da25c6b2..7a24d689e95a65aa65e1ec6558d117d19407a2c6
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_7.c
> @@ -20,5 +20,4 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> index 
> cc904e88170f072e1d3c6be86643d99a7cd5cb12..136d18c2ea89f5a93a1edfc24fe8b7f97bae82d8
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> @@ -20,6 +20,6 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* Peels using a scalar loop.  */
> -/* { dg-final { scan-tree-dump-not "pfa_iv_offset" "vect" } } */
> +/* Peels using fully masked loop.  */
> +/* { dg-final { scan-tree-dump "misalignment for fully-masked loop" "vect" } 
> } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> index 
> 85aab355f95f83e1fa65d280f14fb8ade7f7e658..15dd2d8f45fabcd9b0ec4ef5f4dc83b2b0692822
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr119351.c
> @@ -34,6 +34,5 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
>  
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 
> 6af07efe68ac5e52e76d0c96f94886774cd1dd40..f645d446e3d2a8db4f2583391c7e6016bf665d1e
>  100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -2159,6 +2159,16 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
>        return false;
>      }
>  
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> +      && induction_type == vect_step_op_mul)
> +    {
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                      "Peeling for is not supported for nonlinear mult"
> +                      " induction using partial vectorization.\n");
> +      return false;
> +    }
> +
>    /* Avoid compile time hog on vect_peel_nonlinear_iv_init.  */
>    if (induction_type == vect_step_op_mul)
>      {
> @@ -2313,6 +2323,9 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo)
>                    The phi args associated with the edge UPDATE_E in the bb
>                    UPDATE_E->dest are updated accordingly.
>  
> +     - EARLY_EXIT_P - Indicates whether the exit is an early exit rather than
> +                   the main latch exit.
> +
>       Assumption 1: Like the rest of the vectorizer, this function assumes
>       a single loop exit that has a single predecessor.
>  
> @@ -2331,7 +2344,8 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo)
>  
>  static void
>  vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo,
> -                               tree niters, edge update_e)
> +                               tree niters, edge update_e,
> +                               bool early_exit_p)
>  {
>    gphi_iterator gsi, gsi1;
>    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> @@ -2398,7 +2412,7 @@ vect_update_ivs_after_vectorizer (loop_vec_info 
> loop_vinfo,
>        else
>       ni = vect_peel_nonlinear_iv_init (&stmts, init_expr,
>                                         niters, step_expr,
> -                                       induction_type);
> +                                       induction_type, early_exit_p);
>  
>        var = create_tmp_var (type, "tmp");
>  
> @@ -3568,9 +3582,36 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>        and so the main exit needs to be treated the same as the alternative
>        exits in that we leave their updates to vectorizable_live_operations.
>        */
> -      if (!LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
> -     vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> -                                       update_e);
> +      tree vector_iters_vf = niters_vector_mult_vf;
> +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +     {
> +       tree scal_iv_ty = signed_type_for (TREE_TYPE (vector_iters_vf));
> +       tree tmp_niters_vf = make_ssa_name (scal_iv_ty);
> +       basic_block exit_bb = NULL;
> +       edge update_e = NULL;
> +
> +       /* Identify the early exit merge block.  I wish we had stored this.  
> */
> +       for (auto e : get_loop_exit_edges (loop))
> +         if (e != LOOP_VINFO_IV_EXIT (loop_vinfo))
> +           {
> +             exit_bb = e->dest;
> +             update_e = single_succ_edge (exit_bb);
> +             break;
> +           }
> +       vect_update_ivs_after_vectorizer (loop_vinfo, tmp_niters_vf,
> +                                         update_e, true);
> +
> +       if (LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
> +         vector_iters_vf = tmp_niters_vf;
> +
> +       LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo) = tmp_niters_vf;
> +     }
> +
> +     bool recalculate_peel_niters_init
> +       = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
> +     vect_update_ivs_after_vectorizer (loop_vinfo, vector_iters_vf,
> +                                       update_e,
> +                                       recalculate_peel_niters_init);
>  
>        /* If we have a peeled vector iteration we will never skip the epilog 
> loop
>        and we can simplify the cfg a lot by not doing the edge split.  */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 
> 50cdc2a90fa29c1e0d116c0589bc246e6d8fcc84..91a1833011fc867c21d4018a90f27d1bf9283ca0
>  100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -8934,14 +8934,25 @@ vect_create_nonlinear_iv_init (gimple_seq* stmts, 
> tree init_expr,
>  tree
>  vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree init_expr,
>                            tree skip_niters, tree step_expr,
> -                          enum vect_induction_op_type induction_type)
> +                          enum vect_induction_op_type induction_type,
> +                          bool early_exit_p)
>  {
> -  gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST);
> +  gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST || early_exit_p);
>    tree type = TREE_TYPE (init_expr);
>    unsigned prec = TYPE_PRECISION (type);
>    switch (induction_type)
>      {
> +    /* neg inductions are typically not used for loop termination conditions 
> but
> +       are typically implemented as b = -b.  That is every scalar iteration 
> b is
> +       negated.  That means that for the initial value of b we will have to
> +       determine whether the number of skipped iteration is a multiple of 2
> +       because every 2 scalar iterations we are back at "b".  */
>      case vect_step_op_neg:
> +      /* For early exits the neg induction will always be the same value at 
> the
> +      start of the iteration.  */
> +      if (early_exit_p)
> +     break;
> +
>        if (TREE_INT_CST_LOW (skip_niters) % 2)
>       init_expr = gimple_build (stmts, NEGATE_EXPR, type, init_expr);
>        /* else no change.  */
> @@ -8949,13 +8960,15 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree 
> init_expr,
>  
>      case vect_step_op_shr:
>      case vect_step_op_shl:
> -      skip_niters = gimple_convert (stmts, type, skip_niters);
> -      step_expr = gimple_build (stmts, MULT_EXPR, type, step_expr, 
> skip_niters);
> +      skip_niters = fold_build1 (NOP_EXPR, type, skip_niters);
> +      step_expr = fold_build1 (NOP_EXPR, type, step_expr);
> +      step_expr = fold_build2 (MULT_EXPR, type, step_expr, skip_niters);
>        /* When shift mount >= precision, need to avoid UD.
>        In the original loop, there's no UD, and according to semantic,
>        init_expr should be 0 for lshr, ashl, and >>= (prec - 1) for ashr.  */
> -      if (!tree_fits_uhwi_p (step_expr)
> +      if ((!tree_fits_uhwi_p (step_expr)
>         || tree_to_uhwi (step_expr) >= prec)
> +       && !early_exit_p)
>       {
>         if (induction_type == vect_step_op_shl
>             || TYPE_UNSIGNED (type))
> @@ -8966,13 +8979,19 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree 
> init_expr,
>                                     wide_int_to_tree (type, prec - 1));
>       }
>        else
> -     init_expr = gimple_build (stmts, (induction_type == vect_step_op_shr
> +     {
> +       init_expr = fold_build2 ((induction_type == vect_step_op_shr
>                                         ? RSHIFT_EXPR : LSHIFT_EXPR),
> -                               type, init_expr, step_expr);
> +                                 type, init_expr, step_expr);
> +       init_expr = force_gimple_operand (init_expr, stmts, false, NULL);
> +     }
>        break;
>  
>      case vect_step_op_mul:
>        {
> +     /* Due to UB we can't support vect_step_op_mul with early break for now.
> +        so assert and block.  */
> +     gcc_assert (TREE_CODE (skip_niters) == INTEGER_CST);
>       tree utype = unsigned_type_for (type);
>       init_expr = gimple_convert (stmts, utype, init_expr);
>       wide_int skipn = wi::to_wide (skip_niters);
> @@ -9056,9 +9075,7 @@ vect_update_nonlinear_iv (gimple_seq* stmts, tree 
> vectype,
>      case vect_step_op_mul:
>        {
>       /* Use unsigned mult to avoid UD integer overflow.  */
> -     tree uvectype
> -       = build_vector_type (unsigned_type_for (TREE_TYPE (vectype)),
> -                            TYPE_VECTOR_SUBPARTS (vectype));
> +     tree uvectype = unsigned_type_for (vectype);
>       vec_def = gimple_convert (stmts, uvectype, vec_def);
>       vec_step = gimple_convert (stmts, uvectype, vec_step);
>       vec_def = gimple_build (stmts, MULT_EXPR, uvectype,
> @@ -9305,7 +9322,7 @@ vectorizable_nonlinear_induction (loop_vec_info 
> loop_vinfo,
>       to adjust the start value here.  */
>    if (niters_skip != NULL_TREE)
>      init_expr = vect_peel_nonlinear_iv_init (&stmts, init_expr, niters_skip,
> -                                          step_expr, induction_type);
> +                                          step_expr, induction_type, false);
>  
>    vec_init = vect_create_nonlinear_iv_init (&stmts, init_expr,
>                                           step_expr, nunits, vectype,
> @@ -9686,53 +9703,6 @@ vectorizable_induction (loop_vec_info loop_vinfo,
>                                  LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo));
>        peel_mul = gimple_build_vector_from_val (&init_stmts,
>                                              step_vectype, peel_mul);
> -
> -      /* If early break then we have to create a new PHI which we can use as
> -      an offset to adjust the induction reduction in early exits.
> -
> -      This is because when peeling for alignment using masking, the first
> -      few elements of the vector can be inactive.  As such if we find the
> -      entry in the first iteration we have adjust the starting point of
> -      the scalar code.
> -
> -      We do this by creating a new scalar PHI that keeps track of whether
> -      we are the first iteration of the loop (with the additional masking)
> -      or whether we have taken a loop iteration already.
> -
> -      The generated sequence:
> -
> -      pre-header:
> -        bb1:
> -          i_1 = <number of leading inactive elements>
> -
> -        header:
> -        bb2:
> -          i_2 = PHI <i_1(bb1), 0(latch)>
> -          …
> -
> -        early-exit:
> -        bb3:
> -          i_3 = iv_step * i_2 + PHI<vector-iv>
> -
> -      The first part of the adjustment to create i_1 and i_2 are done here
> -      and the last part creating i_3 is done in
> -      vectorizable_live_operations when the induction extraction is
> -      materialized.  */
> -      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> -       && !LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> -     {
> -       auto skip_niters = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> -       tree ty_skip_niters = TREE_TYPE (skip_niters);
> -       tree break_lhs_phi = vect_get_new_vect_var (ty_skip_niters,
> -                                                   vect_scalar_var,
> -                                                   "pfa_iv_offset");
> -       gphi *nphi = create_phi_node (break_lhs_phi, bb);
> -       add_phi_arg (nphi, skip_niters, pe, UNKNOWN_LOCATION);
> -       add_phi_arg (nphi, build_zero_cst (ty_skip_niters),
> -                    loop_latch_edge (iv_loop), UNKNOWN_LOCATION);
> -
> -       LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo) = PHI_RESULT (nphi);
> -     }
>      }
>    tree step_mul = NULL_TREE;
>    unsigned ivn;
> @@ -10308,8 +10278,7 @@ vectorizable_live_operation (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>                to the latch then we're restarting the iteration in the
>                scalar loop.  So get the first live value.  */
>             bool early_break_first_element_p
> -             = (all_exits_as_early_p || !main_exit_edge)
> -                && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def;
> +             = all_exits_as_early_p || !main_exit_edge;
>             if (early_break_first_element_p)
>               {
>                 tmp_vec_lhs = vec_lhs0;
> @@ -10318,52 +10287,13 @@ vectorizable_live_operation (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>  
>             gimple_stmt_iterator exit_gsi;
>             tree new_tree
> -             = vectorizable_live_operation_1 (loop_vinfo,
> -                                              e->dest, vectype,
> -                                              slp_node, bitsize,
> -                                              tmp_bitstart, tmp_vec_lhs,
> -                                              lhs_type, &exit_gsi);
> +               = vectorizable_live_operation_1 (loop_vinfo,
> +                                                e->dest, vectype,
> +                                                slp_node, bitsize,
> +                                                tmp_bitstart, tmp_vec_lhs,
> +                                                lhs_type, &exit_gsi);
>  
>             auto gsi = gsi_for_stmt (use_stmt);
> -           if (early_break_first_element_p
> -               && LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> -             {
> -               tree step_expr
> -                 = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (stmt_info);
> -               tree break_lhs_phi
> -                 = LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo);
> -               tree ty_skip_niters = TREE_TYPE (break_lhs_phi);
> -               gimple_seq iv_stmts = NULL;
> -
> -               /* Now create the PHI for the outside loop usage to
> -                  retrieve the value for the offset counter.  */
> -               tree rphi_step
> -                 = gimple_convert (&iv_stmts, ty_skip_niters, step_expr);
> -               tree tmp2
> -                 = gimple_build (&iv_stmts, MULT_EXPR,
> -                                 ty_skip_niters, rphi_step,
> -                                 break_lhs_phi);
> -
> -               if (POINTER_TYPE_P (TREE_TYPE (new_tree)))
> -                 {
> -                   tmp2 = gimple_convert (&iv_stmts, sizetype, tmp2);
> -                   tmp2 = gimple_build (&iv_stmts, POINTER_PLUS_EXPR,
> -                                        TREE_TYPE (new_tree), new_tree,
> -                                        tmp2);
> -                 }
> -               else
> -                 {
> -                   tmp2 = gimple_convert (&iv_stmts, TREE_TYPE (new_tree),
> -                                          tmp2);
> -                   tmp2 = gimple_build (&iv_stmts, PLUS_EXPR,
> -                                        TREE_TYPE (new_tree), new_tree,
> -                                        tmp2);
> -                 }
> -
> -               new_tree = tmp2;
> -               gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT);
> -             }
> -
>             tree lhs_phi = gimple_phi_result (use_stmt);
>             remove_phi_node (&gsi, false);
>             gimple *copy = gimple_build_assign (lhs_phi, new_tree);
> @@ -11004,6 +10934,105 @@ move_early_exit_stmts (loop_vec_info loop_vinfo)
>       SET_PHI_ARG_DEF_ON_EDGE (phi, e, last_seen_vuse);
>  }
>  
> +/* Generate adjustment code for early break scalar IVs filling in the value
> +   we created earlier on for LOOP_VINFO_EARLY_BRK_NITERS_VAR.  */
> +
> +static void
> +vect_update_ivs_after_vectorizer_for_early_breaks (loop_vec_info loop_vinfo)
> +{
> +  DUMP_VECT_SCOPE ("move_early_exit_stmts");
> +
> +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    return;
> +
> +  gcc_assert (LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo));
> +
> +  tree phi_var = LOOP_VINFO_EARLY_BRK_NITERS_VAR (loop_vinfo);
> +  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  tree ty_var = TREE_TYPE (phi_var);
> +  auto loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree induc_var = niters_skip ? copy_ssa_name (phi_var) : phi_var;
> +
> +  auto induction_phi = create_phi_node (induc_var, loop->header);
> +  tree induc_def = PHI_RESULT (induction_phi);
> +
> +  /* Create the iv update inside the loop.  */
> +  gimple_seq init_stmts = NULL;
> +  gimple_seq stmts = NULL;
> +  gimple_seq iv_stmts = NULL;
> +  tree tree_vf = build_int_cst (ty_var, vf);
> +
> +  /* For loop len targets we have to use .SELECT_VL (ivtmp_33, VF); instead 
> of
> +     just += VF as the VF can change in between two loop iterations.  */
> +  if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
> +    {
> +      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +      tree_vf = vect_get_loop_len (loop_vinfo, NULL, lens, 1,
> +                                NULL_TREE, 0, 0);
> +    }
> +
> +   tree iter_var;
> +  if (SCALAR_FLOAT_TYPE_P (ty_var))
> +    tree_vf = gimple_convert (&init_stmts, ty_var, tree_vf);
> +
> +  if (POINTER_TYPE_P (ty_var))
> +    {
> +      tree offset = gimple_convert (&stmts, sizetype, tree_vf);
> +      iter_var = gimple_build (&stmts, POINTER_PLUS_EXPR, ty_var, induc_def,
> +                            gimple_convert (&stmts, sizetype, offset));
> +    }
> +  else
> +    {
> +      tree offset = gimple_convert (&stmts, ty_var, tree_vf);
> +      iter_var = gimple_build (&stmts, PLUS_EXPR, ty_var, induc_def, offset);
> +    }
> +
> +  tree init_var = build_zero_cst (ty_var);
> +  if (niters_skip)
> +    init_var = gimple_build (&init_stmts, MINUS_EXPR, ty_var, init_var,
> +                          gimple_convert (&init_stmts, ty_var, niters_skip));
> +
> +  add_phi_arg (induction_phi, iter_var,
> +            loop_latch_edge (loop), UNKNOWN_LOCATION);
> +  add_phi_arg (induction_phi, init_var,
> +            loop_preheader_edge (loop), UNKNOWN_LOCATION);
> +
> +  /* Find the first insertion point in the BB.  */
> +  auto pe = loop_preheader_edge (loop);
> +
> +  /* If we've done any peeling, calculate the peeling adjustment needed to 
> the
> +     final IV.  */
> +  if (niters_skip)
> +    {
> +      induc_def = gimple_build (&iv_stmts, MAX_EXPR, TREE_TYPE (induc_def),
> +                             induc_def,
> +                             build_zero_cst (TREE_TYPE (induc_def)));
> +      auto stmt = gimple_build_assign (phi_var, induc_def);
> +      gimple_seq_add_stmt_without_update (&iv_stmts, stmt);
> +      basic_block exit_bb = NULL;
> +      /* Identify the early exit merge block.  I wish we had stored this.  */
> +      for (auto e : get_loop_exit_edges (loop))
> +     if (e != LOOP_VINFO_IV_EXIT (loop_vinfo))
> +       {
> +         exit_bb = e->dest;
> +         break;
> +       }
> +
> +      gcc_assert (exit_bb);
> +      auto exit_gsi = gsi_after_labels (exit_bb);
> +      gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT);
> +  }
> +
> +  /* Write the init_stmts in the loop-preheader block.  */
> +  auto psi = gsi_last_nondebug_bb (pe->src);
> +  gsi_insert_seq_after (&psi, init_stmts, GSI_LAST_NEW_STMT);
> +  /* Wite the adjustments in the header block.  */
> +  basic_block bb = loop->header;
> +  auto si = gsi_after_labels (bb);
> +  gsi_insert_seq_before (&si, stmts, GSI_SAME_STMT);
> +}
> +
>  /* Function vect_transform_loop.
>  
>     The analysis phase has determined that the loop is vectorizable.
> @@ -11148,7 +11177,10 @@ vect_transform_loop (loop_vec_info loop_vinfo, 
> gimple *loop_vectorized_call)
>    /* Handle any code motion that we need to for early-break vectorization 
> after
>       we've done peeling but just before we start vectorizing.  */
>    if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> -    move_early_exit_stmts (loop_vinfo);
> +    {
> +      vect_update_ivs_after_vectorizer_for_early_breaks (loop_vinfo);
> +      move_early_exit_stmts (loop_vinfo);
> +    }
>  
>    /* Remove existing clobber stmts and prefetches.  */
>    for (i = 0; i < nbbs; i++)
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 
> aa6c3e2e041da26d2e5314b9edd0c5b934a9dd1d..c6b8e93c335616e4cab295a56ead5ca99bca8e82
>  100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -5733,48 +5733,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
> max_tree_size,
>                                            "SLP build failed.\n");
>           }
>       }
> -
> -     /* Find and create slp instances for inductions that have been forced
> -        live due to early break.  */
> -     edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo));
> -     for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo))
> -       {
> -         vec<stmt_vec_info> stmts;
> -         vec<stmt_vec_info> roots = vNULL;
> -         vec<tree> remain = vNULL;
> -         gphi *phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info));
> -         tree def = gimple_phi_arg_def_from_edge (phi, latch_e);
> -         stmt_vec_info lc_info = loop_vinfo->lookup_def (def);
> -         if (lc_info)
> -           {
> -             stmts.create (1);
> -             stmts.quick_push (vect_stmt_to_vectorize (lc_info));
> -             if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group,
> -                                            stmts, roots, remain,
> -                                            max_tree_size, &limit,
> -                                            bst_map, force_single_lane))
> -               return opt_result::failure_at (vect_location,
> -                                              "SLP build failed.\n");
> -           }
> -         /* When the latch def is from a different cycle this can only
> -            be a induction.  Build a simple instance for this.
> -            ???  We should be able to start discovery from the PHI
> -            for all inductions, but then there will be stray
> -            non-SLP stmts we choke on as needing non-SLP handling.  */
> -         auto_vec<stmt_vec_info, 1> tem;
> -         tem.quick_push (stmt_info);
> -         if (!bst_map->get (tem))
> -           {
> -             stmts.create (1);
> -             stmts.quick_push (stmt_info);
> -             if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group,
> -                                            stmts, roots, remain,
> -                                            max_tree_size, &limit,
> -                                            bst_map, force_single_lane))
> -               return opt_result::failure_at (vect_location,
> -                                              "SLP build failed.\n");
> -           }
> -       }
>      }
>  
>    hash_set<slp_tree> visited_patterns;
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 
> 83acbb3ff67ccdd4a39606850a23f483d6a4b1fb..eb46f1e86575dc72cead4b214af02a31c0d336ff
>  100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -356,7 +356,6 @@ is_simple_and_all_uses_invariant (stmt_vec_info stmt_info,
>     - it has uses outside the loop.
>     - it has vdefs (it alters memory).
>     - control stmts in the loop (except for the exit condition).
> -   - it is an induction and we have multiple exits.
>  
>     CHECKME: what other side effects would the vectorizer allow?  */
>  
> @@ -420,9 +419,8 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, 
> loop_vec_info loop_vinfo,
>  
>    /* Check if it's a not live PHI and multiple exits.  In this case
>       there will be a usage later on after peeling which is needed for the
> -     alternate exit.
> -     ???  Unless the PHI was marked live because of early
> -     break, which also needs the latch def live and vectorized.  */
> +     alternate exit.  Explicitly force it to be live but irrelevant for
> +     vectorization, otherwise we will still analize it for VF usage.  */
>    if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
>        && is_a <gphi *> (stmt)
>        && gimple_bb (stmt) == LOOP_VINFO_LOOP (loop_vinfo)->header
> @@ -432,10 +430,10 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, 
> loop_vec_info loop_vinfo,
>      {
>        if (dump_enabled_p ())
>       dump_printf_loc (MSG_NOTE, vect_location,
> -                      "vec_stmt_relevant_p: PHI forced live for "
> -                      "early break.\n");
> -      LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo).safe_push (stmt_info);
> +                      "vec_stmt_relevant_p: PHI live but not relevant due to"
> +                      " early break.\n");
>        *live_p = true;
> +      return true;
>      }
>  
>    if (*live_p && *relevant == vect_unused_in_scope
> @@ -12750,17 +12748,12 @@ can_vectorize_live_stmts (vec_info *vinfo,
>                         bool vec_stmt_p,
>                         stmt_vector_for_cost *cost_vec)
>  {
> -  loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
>    stmt_vec_info slp_stmt_info;
>    unsigned int i;
>    FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (slp_node), i, slp_stmt_info)
>      {
>        if (slp_stmt_info
> -       && (STMT_VINFO_LIVE_P (slp_stmt_info)
> -           || (loop_vinfo
> -               && LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> -               && STMT_VINFO_DEF_TYPE (slp_stmt_info)
> -               == vect_induction_def))
> +       && STMT_VINFO_LIVE_P (slp_stmt_info)
>         && !vectorizable_live_operation (vinfo, slp_stmt_info, slp_node,
>                                          slp_node_instance, i,
>                                          vec_stmt_p, cost_vec))
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 
> 359c994139b211d3b4ac54d08890465f04ef107a..5362da5f2b40f556dd7d81b99a3b7b763f4ab296
>  100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -1233,6 +1233,10 @@ public:
>       happen.  */
>    auto_vec<gimple*> early_break_vuses;
>  
> +  /* The IV adjustment value for inductions that needs to be materialized
> +     inside the relavent exit blocks in order to adjust for early break.  */
> +  tree early_break_niters_var;
> +
>    /* Record statements that are needed to be live for early break 
> vectorization
>       but may not have an LC PHI node materialized yet in the exits.  */
>    auto_vec<stmt_vec_info> early_break_live_ivs;
> @@ -1302,6 +1306,7 @@ public:
>    (L)->early_break_live_ivs
>  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)->early_break_dest_bb
>  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> +#define LOOP_VINFO_EARLY_BRK_NITERS_VAR(L) (L)->early_break_niters_var
>  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
>  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
>  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
> @@ -2712,7 +2717,8 @@ extern tree cse_and_gimplify_to_preheader 
> (loop_vec_info, tree);
>  
>  /* Nonlinear induction.  */
>  extern tree vect_peel_nonlinear_iv_init (gimple_seq*, tree, tree,
> -                                      tree, enum vect_induction_op_type);
> +                                      tree, enum vect_induction_op_type,
> +                                      bool);
>  
>  /* In tree-vect-slp.cc.  */
>  extern void vect_slp_init (void);
> 
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH v2][vect]: support vectorization of early break forced live IVs as scalar

Reply via email to