Richard Biener <richard.guent...@gmail.com> writes:
> On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.ch...@gmail.com> wrote:
>>
>> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener
>> <richard.guent...@gmail.com> wrote:
>> >
>> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
>> > <richard.guent...@gmail.com> wrote:
>> > >
>> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
>> > > > <richard.guent...@gmail.com> wrote:
>> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.ch...@gmail.com> 
>> > > >> wrote:
>> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.ch...@gmail.com> 
>> > > >>> wrote:
>> > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
>> > > >>>> <richard.guent...@gmail.com> wrote:
>> > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> 
>> > > >>>>> wrote:
>> > > >>>>>> Hi,
>> > > >>>>>> As explained in the PR, given below test case:
>> > > >>>>>> int a[8][10] = { [2][5] = 4 }, c;
>> > > >>>>>>
>> > > >>>>>> int
>> > > >>>>>> main ()
>> > > >>>>>> {
>> > > >>>>>>   short b;
>> > > >>>>>>   int i, d;
>> > > >>>>>>   for (b = 4; b >= 0; b--)
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][b + 2] = a[c][b + 1];
>> > > >>>>>>   for (i = 0; i < 8; i++)
>> > > >>>>>>     for (d = 0; d < 10; d++)
>> > > >>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
>> > > >>>>>>         __builtin_abort ();
>> > > >>>>>>   return 0;
>> > > >>>>>>
>> > > >>>>>> the loop nest is illegal for vectorization without reversing 
>> > > >>>>>> inner loop.  The issue
>> > > >>>>>> is in data dependence checking of vectorizer, I believe the 
>> > > >>>>>> mentioned revision just
>> > > >>>>>> exposed this.  Previously the vectorization is skipped because of 
>> > > >>>>>> unsupported memory
>> > > >>>>>> operation.  The outer loop vectorization unrolls the outer loop 
>> > > >>>>>> into:
>> > > >>>>>>
>> > > >>>>>>   for (b = 4; b > 0; b -= 4)
>> > > >>>>>>   {
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][6] = a[c][5];
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][5] = a[c][4];
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][4] = a[c][3];
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>       a[c + 1][3] = a[c][2];
>> > > >>>>>>   }
>> > > >>>>>> Then four inner loops are fused into:
>> > > >>>>>>   for (b = 4; b > 0; b -= 4)
>> > > >>>>>>   {
>> > > >>>>>>     for (c = 0; c <= 6; c++)
>> > > >>>>>>     {
>> > > >>>>>>       a[c + 1][6] = a[c][5];  // S1
>> > > >>>>>>       a[c + 1][5] = a[c][4];  // S2
>> > > >>>>>>       a[c + 1][4] = a[c][3];
>> > > >>>>>>       a[c + 1][3] = a[c][2];
>> > > >>>>>>     }
>> > > >>>>>>   }
>> > > >>>>>
>> > > >>>>> Note that they are not really "fused" but they are interleaved.  
>> > > >>>>> With
>> > > >>>>> GIMPLE in mind
>> > > >>>>> that makes a difference, you should get the equivalent of
>> > > >>>>>
>> > > >>>>>    for (c = 0; c <= 6; c++)
>> > > >>>>>      {
>> > > >>>>>        tem1 = a[c][5];
>> > > >>>>>        tem2 = a[c][4];
>> > > >>>>>        tem3 = a[c][3];
>> > > >>>>>        tem4 = a[c][2];
>> > > >>>>>        a[c+1][6] = tem1;
>> > > >>>>>        a[c +1][5] = tem2;
>> > > >>>>>         a[c+1][4] = tem3;
>> > > >>>>>         a[c+1][3] = tem4;
>> > > >>>>>      }
>> > > >>>> Yeah, I will double check if this abstract breaks the patch and how.
>> > > >>> Hmm, I think this doesn't break it, well at least for part of the
>> > > >>> analysis, because it is loop carried (backward) dependence goes 
>> > > >>> wrong,
>> > > >>> interleaving or not with the same iteration doesn't matter here.
>> > > >>
>> > > >> I think the idea is that forward dependences are always fine 
>> > > >> (negative distance)
>> > > >> to vectorize.  But with backward dependences we have to adhere to 
>> > > >> max_vf.
>> > > >>
>> > > >> It looks like for outer loop vectorization we only look at the 
>> > > >> distances in the
>> > > >> outer loop but never at inner ones.  But here the same applies but 
>> > > >> isn't that
>> > > >> independend on the distances with respect to the outer loop?
>> > > >>
>> > > >> But maybe I'm misunderstanding how "distances" work here.
>> > > > Hmm, I am not sure I understand "distance" correctly.  With
>> > > > description as in book like "Optimizing compilers for Modern
>> > > > Architectures", distance is "# of iteration of sink ref - # of
>> > > > iteration of source ref".  Given below example:
>> > > >   for (i = 0; i < N; ++i)
>> > > >     {
>> > > >       x = arr[idx_1];  // S1
>> > > >       arr[idx_2] = x;  // S2
>> > > >     }
>> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
>> > > > this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
>> > > > i;
>> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
>> > > > this is backward dependence.  For example idx_1 is i and idx_2 is i +
>> > > > 1;
>> > > >
>> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing
>> > > > classic distance, we could result in negative distance in case of
>> > > > backward dependence.  When this happens at dependence carried level,
>> > > > we manually reverse it.  When this happens at inner level loop, we
>> > > > simply keep the negative distance.  And it's meaningless to talk about
>> > > > forward/backward given dependence is carried at outer level loop.
>> > > >
>> > > > Outer loop vectorization is interesting.  The problematic case has
>> > > > backward dependence carried by outer loop.  Because we don't check
>> > > > dist vs. max_vf for such dep, the unrolled references could have outer
>> > > > loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
>> > > > like we have outer loop index resolved as equal.  Now it has
>> > > > dependence only if c == c' + 1.  I think previous comment on fusion
>> > > > still hold for this and we now need to check backward dependence
>> > > > between the two reference at inner level loop.  I guess it's a bit
>> > > > trick to model dependence between a[c][5] and a[c+1][5] using the
>> > > > original references and dependence.  I think it's okay to do that
>> > > > because distance of a[c][5] and a[c+1][5] is what we computed
>> > > > previously for the original references at inner level loop.
>> > > >
>> > > > Not sure if I have missed something important.
>> > >
>> > > Not sure either.  unroll-and-jam does
>> > >
>> > >       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>> > >         {
>> > >           /* A distance (a,b) is at worst transformed into (a/N,b) by the
>> > >              unrolling (factor N), so the transformation is valid if
>> > >              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the 
>> > > unroll
>> > >              factor needs to be limited so that the first condition 
>> > > holds.
>> > >              That may limit the factor down to zero in the worst case.  
>> > > */
>> > >           int dist = dist_v[0];
>> > >           if (dist < 0)
>> > >             gcc_unreachable ();
>> > >           else if ((unsigned)dist >= *unroll)
>> > >             ;
>> > >           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS 
>> > > (ddr) - 1)
>> > >                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS 
>> > > (ddr) - 1)
>> > >                        && dist > 0))
>> > >             ;
>> > >           else
>> > >             *unroll = dist;
>> > >
>> > > where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost 
>> > > loop.
>> > >
>> > > The vectorizer does way more complicated things and only looks at the
>> > > distance with respect to the outer loop as far as I can see which can
>> > > be negative.
>> > >
>> > > Not sure if fusion and vectorizer "interleaving" makes a difference here.
>> > > I think the idea was that when interleaving stmt-by-stmt then forward
>> > > dependences would be preserved and thus we don't need to check the
>> > > inner loop dependences.  speaking with "forward vs. backward" dependences
>> > > again, not distances...
>> > >
>> > > This also means that unroll-and-jam could be enhanced to "interleave"
>> > > stmts and thus cover more cases?
>> > >
>> > > Again, I hope Micha can have a look here...
>> >
>> > Coming back to this...  the vectorizer doesn't look at the inner loop 
>> > distances
>> > at all at the moment.  Bins orginal patch adds that for the case where the
>> > outer loop evolution results in a negative dependence distance.  But what
>> > if the outer loop evolution results in a zero distance or a positive 
>> > distance
>> > within the bounds of a sensible max_vf?  So I think we need to "simply"
>> > look at all distance vector components, not just that of the outer loop
>> > and perform all regular checks.  That also allows adjustment of max_vf
>> > for inner loop dependences in case vectorization with smaller vectors is
>> > still possible.
>> >
>> > Doing that by making loop_depth iterate
>> > fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c
>> > if we make the code look at ->safelen of the inner loop when processing
>> > inner loop distances (because safelen is 0 there).  Just using ->safelen
>> > from the outer loop doesn't regress anything it seems.
>> >
>> > So I am going to test the following attached.  I have added the
>> > testcase also with reversed inner loop to verify we'd apply outer loop
>> > vectorization to that one.
>> >
>> > Any comments?
>> I kind of think that we are checking for different things in this way.
>> IIUC, dependence checks in outer loop vectorizer could be categorized
>> into three parts.  The regular vect related check on outer loop;
>> dependence check to fuse inner loop iterations; and dependence check
>> to shuffle scalar statement/reference together.  To pass the fusion
>> check, we only need to make sure no backward dependence is generated
>> in the result.  So for case of zero/positive distance in outer loop
>> evolution, it won't result in backward dependence?  For example, if we
>> adjust the original loop like:
>>
>>    for (b = 4; b >= 0; b--)
>>      for (c = 0; c <= 6; c++)
>> -       a[c + 1][b + 2] = a[c][b + 1];
>> +      a[c + 1][b + 1] = a[c][b + 2];
>>
>> The fusion result would be
>>    for (b = 4; b > 0; b -= 4)
>>    {
>>      for (c = 0; c <= 6; c++)
>>      {
>>        a[c + 1][5] = a[c][6];  // S1
>>        a[c + 1][4] = a[c][5];  // S2
>>        a[c + 1][3] = a[c][4];
>>        a[c + 1][2] = a[c][3];
>>      }
>>    }
>> Though there is dependence between s1/s2, it's forward dependence now.
>
> Hmm, OK.  But this doesn't vectorize with or without the patch, we claim
>
> t.c:9:3: note:   dependence distance  = 1.
> t.c:11:29: missed:   not vectorized, possible dependence between
> data-refs a[c.3_15][_48] and a[_3][_50]
> t.c:9:3: missed:  bad data dependence.
> t.c:9:3: missed: couldn't vectorize loop
>
>> Even we need to check the negative distance for inner loop, using the
>> regular checks for fusion part for inner loop(s) is a bit strange.
>> Using outer loop's safelen here is also unclear.
>>
>> I am not sure where we do the shuffle related check, is it (dist == 0) case?
>> Is that the reason that the patch employs the whole regular checks?
>
> I think the suffling needs to look at the outer loop distance and it is the
> == 0 and the forward dependence case where we check/constrain
> against the vectorization factor?
>
> I agree that formulating the outer loop dependence testing in a way to
> check the unroll-and-jam condition and the shuffling part separately
> would be best.  I was mainly worried that your patch only hooks into
> the negative "outer" distance case and not the zero or positive
> distance one.  For example for unroll-and-jam we constrain the
> maximum unroll factor.  I wonder if we should simply call into
> its adjust_unroll_factor from the vectorizer or copy&paste it
> to the vectorizer.  I'll note that with known dependence distances
> it seems to never compute failure...  at least it computes the
> correct maximum vectorization factor for me.
>
> I also wonder about dependences for DRs in the outer loop
> which we'd even more heavily re-order.
>
> Anyways, we should somehow fix this.
>
> Attached is a patch variant cut&pasting the unroll-and-jam
> testing.
>
> Does this look better?
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>> >
>> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
>> >
>> > Richard.
>> >
>> > 2019-03-21  Richard Biener  <rguent...@suse.de>
>> >
>> >         PR tree-optimization/81740
>> >         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
>> >         Perform distance vector related dependence checks for all
>> >         loops.
>> >
>> >         * gcc.dg/vect/pr81740-1.c: New testcase.
>> >         * gcc.dg/vect/pr81740-2.c: Likewise.
>
> 2019-03-25  Richard Biener  <rguent...@suse.de>
>
>         PR tree-optimization/81740
>       * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
>       Add explicit unroll-and-jam dependence testing.
>
>       * gcc.dg/vect/pr81740-1.c: New testcase.
>       * gcc.dg/vect/pr81740-2.c: Likewise.
>
> Index: gcc/tree-vect-data-refs.c
> ===================================================================
> --- gcc/tree-vect-data-refs.c (revision 269908)
> +++ gcc/tree-vect-data-refs.c (working copy)
> @@ -411,6 +411,45 @@ vect_analyze_data_ref_dependence (struct
>      }
>  
>    loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
> +  gcc_assert (loop_depth == 0);
> +
> +  /* Perform unroll-and-jam test.  */
> +  if (nested_in_vect_loop_p (loop, stmtinfo_a))
> +    {
> +      FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> +     {
> +       /* A distance (a,b) is at worst transformed into (a/N,b) by the
> +          unrolling (factor N), so the transformation is valid if
> +          a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
> +          factor needs to be limited so that the first condition holds.
> +          That may limit the factor down to zero in the worst case.  */

I think it would be good to say that we're applying a two-stage
check here (unroll-and-jam, then statement-level interleaving).  As it
stands, it sounds like proving (a>0 && b==0) || b>0 is enough to prove
that outer-loop vectorisation is valid for any VF, whereas that isn't
true for the interleaving part.  That said...

> +       int dist = dist_v[0];
> +       if (dist < 0)
> +         gcc_unreachable ();
> +       else if ((unsigned)dist >= *max_vf)
> +         ;
> +       else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> +                || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
> +                    && dist > 0))
> +         ;
> +       else if (dist <= 1)
> +         return opt_result::failure_at (stmtinfo_a->stmt,
> +                                        "not vectorized, possible dependence"
> +                                        " between data-refs %T and %T\n",
> +                                        DR_REF (dra), DR_REF (drb));
> +       else
> +         {
> +           /* Dependence distance does not create dependence, as far as
> +              vectorization is concerned, in this case.  */
> +           if (dump_enabled_p ())
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "dependence between data-refs %T and %T "
> +                              "constrains vectorization factor to %d\n",
> +                              DR_REF (dra), DR_REF (drb), dist);
> +           *max_vf = dist;
> +         }
> +     }
> +    }

...I guess this is looping back to the original discussion, sorry, but I'm
not sure taking it as a two-stage check really helps.  We're going from:

A:
   for i in [0:n]:
     for j in [0:n]:
       A(i,j)
       B(i,j)

to:

B:
   for i in [0:n:2]:
     for j in [0:n]:
       A(i,j)
       B(i,j)
     for j in [0:n]:
       A(i+1,j)
       B(i+1,j)

to:

C:
   for i in [0:n:2]:
     for j in [0:n]:
       A(i,j)
       B(i,j)
       A(i+1,j)
       B(i+1,j)

to:

D:
   for i in [0:n:2]:
     for j in [0:n]:
       A(i,j)
       A(i+1,j)
       B(i,j)
       B(i+1,j)

but since outer loop vectorisation doesn't change the order of
statements for each "i", IMO it's clearer to go directly from B to D
by treating the inner loop as though we were completely unrolling it.
C doesn't seem like a useful midway point.

I think the only cases the patch is handling on top of existing checks are:

(a) a<0 && b>0, normalised to a>0 && b<0 with a reverse flag.
(b) a==0 && b==0, which we now reject earlier

I don't think (b) makes any difference in practice because
(i) we already reject accesses with a zero step and (ii)
without the support for runtime aliasing checks for outer loops,
we're not going to be able to add any extra disambiguation for
things like:

    d[i][j] = ...;
    ... = d[i][j];

relative to other loop accesses, so there won't be any cases we can
handle here that wouldn't have been propagated earlier.

So I think we're really left with (a) being the useful case,
which is also the one added by Bin's original patch.

Based on the "complete unrolling" view, if we number statements as
(i, n), where i is the outer loop iteration and n is a statement number
in the (completely unrolled) loop body, then the original scalar code
executes in lexicographical order while for the vector loop:

(1) (i,n) executes before (i+ix,n+nx) for all ix>=0, nx>=1, regardless of VF
(2) (i,n) executes before (i+ix,n-nx) for all ix>=VF, nx>=0
      (well, nx unrestricted, but only nx>=0 is useful given (1))

So for any kind of dependence between (i,n) and (i+ix,n-nx), ix>=1, nx>=0
we need to restrict VF to ix so that (2) ensures the right order.
This means that the unnormalised distances of interest are:

- (ix, -nx), ix>=1, nx>=0
- (-ix, nx), ix>=1, nx>=0

But the second gets normalised to the first, which is actually useful
in this case :-).

In terms of the existing code, I think that means we want to change
the handling of nested statements (only) to:

- ignore DDR_REVERSED_P (ddr)
- restrict the main dist > 0 case to when the inner distance is <= 0.

This should have the side effect of allowing outer-loop vectorisation for:

void __attribute__ ((noipa))
f (int a[][N], int b[restrict])
{
  for (int i = N - 1; i-- > 0; )
    for (int j = 0; j < N - 1; ++j)
      a[j + 1][i] = a[j][i + 1] + b[i];
}

At the moment we reject this, but AFAICT it should be OK.
(We do allow it for s/i + 1/i/, since then the outer distance is 0.)

I think we can also skip the existing dist==0 code for nested statements
since any case that really matters (such as zero steps) should also have
a second dependence distance (1, 0) that we'd handle as above.  Doing that
is probably something for stage 1 though.

Hope at least some of that was vaguely right. ;-)

Thanks,
Richard

Reply via email to