On Thu, 4 Apr 2024, Tamar Christina wrote:

> Hi All,
> 
> The report shows that we end up in a situation where the code has been peeled
> for gaps and we have an early break.
> 
> The code for peeling for gaps assume that a scalar loop needs to perform at
> least one iteration.  However this doesn't take into account early break where
> the scalar loop may not need to be executed.

But we always re-start the vector iteration where the early break happens?

> That the early break loop can be partial is not accounted for in this 
> scenario.
> loop partiality is normally handled by setting bias_for_lowest to 1, but when
> peeling for gaps we end up with 0, which when the loop upper bounds are
> calculated means that a partial loop iteration loses the final partial iter:
> 
> Analyzing # of iterations of loop 1
>   exit condition [8, + , 18446744073709551615] != 0
>   bounds on difference of bases: -8 ... -8
>   result:
>     # of iterations 8, bounded by 8
> 
> and a VF=4 calculating:
> 
> Loop 1 iterates at most 1 times.
> Loop 1 likely iterates at most 1 times.
> Analyzing # of iterations of loop 1
>   exit condition [1, + , 1](no_overflow) < bnd.5505_39
>   bounds on difference of bases: 0 ... 4611686018427387902
> Matching expression match.pd:2011, generic-match-8.cc:27
> Applying pattern match.pd:2067, generic-match-1.cc:4813
>   result:
>     # of iterations bnd.5505_39 + 18446744073709551615, bounded by 
> 4611686018427387902
> Estimating sizes for loop 1
> ...
>    Induction variable computation will be folded away.
>   size:   2 if (ivtmp_312 < bnd.5505_39)
>    Exit condition will be eliminated in last copy.
> size: 24-3, last_iteration: 24-5
>   Loop size: 24
>   Estimated size after unrolling: 26
> ;; Guessed iterations of loop 1 is 0.858446. New upper bound 1.
> 
> upper bound should be 2 not 1.

Why?  This means the vector loop will iterate once (thus the body
executed twice), isn't that correct?  Peeling for gaps means the
main IV will exit the loop in time.

> 
> This patch forced the bias_for_lowest to be 1 even when peeling for gaps.

(*)

> I have however not been able to write a standalone reproducer for this so I 
> have
> no tests but bootstrap and LLVM build fine now.
> 
> The testcase:
> 
> #define COUNT 9
> #define SIZE COUNT * 4
> #define TYPE unsigned long
> 
> TYPE x[SIZE], y[SIZE];
> 
> void __attribute__((noipa))
> loop (TYPE val)
> {
>   for (int i = 0; i < COUNT; ++i)
>     {
>       if (x[i * 4] > val || x[i * 4 + 1] > val)
>         return;
>       x[i * 4] = y[i * 2] + 1;
>       x[i * 4 + 1] = y[i * 2] + 2;
>       x[i * 4 + 2] = y[i * 2 + 1] + 3;
>       x[i * 4 + 3] = y[i * 2 + 1] + 4;
>     }
> }
> 
> does perform the peeling for gaps and early beak, however it creates a hybrid
> loop which works fine. adjusting the indices to non linear also works. So I'd
> like to submit the fix and work on a testcase separately if needed.

You can have peeling for gaps without SLP by doing interleaving.

#define COUNT 9
#define TYPE unsigned long

TYPE x[COUNT], y[COUNT*2];

void __attribute__((noipa))
loop (TYPE val)
{
  for (int i = 0; i < COUNT; ++i)
    { 
      if (x[i] > val)
        return;
      x[i] = y[i * 2];
   }
}

gets me partial vectors and peeling for gaps with -O3 -march=armv8.2-a+sve 
-fno-vect-cost-model (with cost modeling we choose ADVSIMD).  Does
this reproduce the issue?

Richard.


> Bootstrapped Regtested on x86_64-pc-linux-gnu no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       PR tree-optimization/114403
>       * tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds for when
>       peeling for gaps and early break.
> 
> ---
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 
> 4375ebdcb493a90fd0501cbb4b07466077b525c3..bf1bb9b005c68fbb13ee1b1279424865b237245a
>  100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -12139,7 +12139,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
> *loop_vectorized_call)
>    /* The minimum number of iterations performed by the epilogue.  This
>       is 1 when peeling for gaps because we always need a final scalar
>       iteration.  */
> -  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
> +  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +                        && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo) ? 1 : 0;

(*) This adjusts min_epilogue_iters though and honestly the whole code
looks like a mess.  I'm quoting a bit more here:

>    /* +1 to convert latch counts to loop iteration counts,
>       -min_epilogue_iters to remove iterations that cannot be performed
>         by the vector code.  */
  int bias_for_lowest = 1 - min_epilogue_iters;
  int bias_for_assumed = bias_for_lowest;

The variable names and comments now have nothing to do with the
actual magic we compute into them.

I think it would be an improvement to disentangle this a bit like
doing

   /* +1 to convert latch counts to loop iteration counts.  */
   int bias_for_lowest = 1;
   /* Comment, explain why peeling for gaps isn't relevant.  */
   if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     bias_for_lowest += ... ?
   /* When peeling for gaps we have at least one scalar iteration in
      the epilog.  */
   else if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
     bias_for_lowest -= 1;
   int bias_for_assumed = bias_for_lowest;

I'm still not convinced that you need to "ignore" peeling for gaps
when doing early exit vectorization?

Richard.

Reply via email to