Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
Richard Biener writes: > On Tue, Mar 26, 2019 at 1:56 AM Richard Sandiford > wrote: >> Based on the "complete unrolling" view, if we number statements as >> (i, n), where i is the outer loop iteration and n is a statement number >> in the (completely unrolled) loop body, then the original scalar code >> executes in lexicographical order while for the vector loop: >> >> (1) (i,n) executes before (i+ix,n+nx) for all ix>=0, nx>=1, regardless of VF >> (2) (i,n) executes before (i+ix,n-nx) for all ix>=VF, nx>=0 >> (well, nx unrestricted, but only nx>=0 is useful given (1)) >> >> So for any kind of dependence between (i,n) and (i+ix,n-nx), ix>=1, nx>=0 >> we need to restrict VF to ix so that (2) ensures the right order. >> This means that the unnormalised distances of interest are: >> >> - (ix, -nx), ix>=1, nx>=0 >> - (-ix, nx), ix>=1, nx>=0 >> >> But the second gets normalised to the first, which is actually useful >> in this case :-). >> >> In terms of the existing code, I think that means we want to change >> the handling of nested statements (only) to: >> >> - ignore DDR_REVERSED_P (ddr) >> - restrict the main dist > 0 case to when the inner distance is <= 0. >> >> This should have the side effect of allowing outer-loop vectorisation for: >> >> void __attribute__ ((noipa)) >> f (int a[][N], int b[restrict]) >> { >> for (int i = N - 1; i-- > 0; ) >> for (int j = 0; j < N - 1; ++j) >> a[j + 1][i] = a[j][i + 1] + b[i]; >> } >> >> At the moment we reject this, but AFAICT it should be OK. >> (We do allow it for s/i + 1/i/, since then the outer distance is 0.) > > Can you file an enhancement request so we don't forget? OK, for the record it's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89908
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Tue, Mar 26, 2019 at 1:56 AM Richard Sandiford wrote: > > Richard Biener writes: > > On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng wrote: > >> > >> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener > >> wrote: > >> > > >> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener > >> > wrote: > >> > > > >> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng > >> > > wrote: > >> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener > >> > > > wrote: > >> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng > >> > > >> wrote: > >> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng > >> > > >>> wrote: > >> > > On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > >> > > wrote: > >> > > > On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng > >> > > > wrote: > >> > > >> Hi, > >> > > >> As explained in the PR, given below test case: > >> > > >> int a[8][10] = { [2][5] = 4 }, c; > >> > > >> > >> > > >> int > >> > > >> main () > >> > > >> { > >> > > >> short b; > >> > > >> int i, d; > >> > > >> for (b = 4; b >= 0; b--) > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][b + 2] = a[c][b + 1]; > >> > > >> for (i = 0; i < 8; i++) > >> > > >> for (d = 0; d < 10; d++) > >> > > >> if (a[i][d] != (i == 3 && d == 6) * 4) > >> > > >> __builtin_abort (); > >> > > >> return 0; > >> > > >> > >> > > >> the loop nest is illegal for vectorization without reversing > >> > > >> inner loop. The issue > >> > > >> is in data dependence checking of vectorizer, I believe the > >> > > >> mentioned revision just > >> > > >> exposed this. Previously the vectorization is skipped because > >> > > >> of unsupported memory > >> > > >> operation. The outer loop vectorization unrolls the outer loop > >> > > >> into: > >> > > >> > >> > > >> for (b = 4; b > 0; b -= 4) > >> > > >> { > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][6] = a[c][5]; > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][5] = a[c][4]; > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][4] = a[c][3]; > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][3] = a[c][2]; > >> > > >> } > >> > > >> Then four inner loops are fused into: > >> > > >> for (b = 4; b > 0; b -= 4) > >> > > >> { > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> { > >> > > >> a[c + 1][6] = a[c][5]; // S1 > >> > > >> a[c + 1][5] = a[c][4]; // S2 > >> > > >> a[c + 1][4] = a[c][3]; > >> > > >> a[c + 1][3] = a[c][2]; > >> > > >> } > >> > > >> } > >> > > > > >> > > > Note that they are not really "fused" but they are interleaved. > >> > > > With > >> > > > GIMPLE in mind > >> > > > that makes a difference, you should get the equivalent of > >> > > > > >> > > >for (c = 0; c <= 6; c++) > >> > > > { > >> > > >tem1 = a[c][5]; > >> > > >tem2 = a[c][4]; > >> > > >tem3 = a[c][3]; > >> > > >tem4 = a[c][2]; > >> > > >a[c+1][6] = tem1; > >> > > >a[c +1][5] = tem2; > >> > > > a[c+1][4] = tem3; > >> > > > a[c+1][3] = tem4; > >> > > > } > >> > > Yeah, I will double check if this abstract breaks the patch and > >> > > how. > >> > > >>> Hmm, I think this doesn't break it, well at least for part of the > >> > > >>> analysis, because it is loop carried (backward) dependence goes > >> > > >>> wrong, > >> > > >>> interleaving or not with the same iteration doesn't matter here. > >> > > >> > >> > > >> I think the idea is that forward dependences are always fine > >> > > >> (negative distance) > >> > > >> to vectorize. But with backward dependences we have to adhere to > >> > > >> max_vf. > >> > > >> > >> > > >> It looks like for outer loop vectorization we only look at the > >> > > >> distances in the > >> > > >> outer loop but never at inner ones. But here the same applies but > >> > > >> isn't that > >> > > >> independend on the distances with respect to the outer loop? > >> > > >> > >> > > >> But maybe I'm misunderstanding how "distances" work here. > >> > > > Hmm, I am not sure I understand "distance" correctly. With > >> > > > description as in book like "Optimizing compilers for Modern > >> > > > Architectures", distance is "# of iteration of sink ref - # of > >> > > > iteration of source ref". Given below example: > >> > > > for (i = 0; i < N; ++i) > >> > > > { > >> > > > x = arr[idx_1]; // S1 > >> > > > arr[idx_2] = x; // S2 > >> > > > } > >> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. > >> > > > Also > >> > > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is > >> > > > i; > >> > > > If S2 is source ref, distance = idx_1 -
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Tue, Mar 26, 2019 at 8:56 AM Richard Sandiford wrote: > > Richard Biener writes: > > On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng wrote: > >> > >> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener > >> wrote: > >> > > >> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener > >> > wrote: > >> > > > >> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng > >> > > wrote: > >> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener > >> > > > wrote: > >> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng > >> > > >> wrote: > >> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng > >> > > >>> wrote: > >> > > On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > >> > > wrote: > >> > > > On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng > >> > > > wrote: > >> > > >> Hi, > >> > > >> As explained in the PR, given below test case: > >> > > >> int a[8][10] = { [2][5] = 4 }, c; > >> > > >> > >> > > >> int > >> > > >> main () > >> > > >> { > >> > > >> short b; > >> > > >> int i, d; > >> > > >> for (b = 4; b >= 0; b--) > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][b + 2] = a[c][b + 1]; > >> > > >> for (i = 0; i < 8; i++) > >> > > >> for (d = 0; d < 10; d++) > >> > > >> if (a[i][d] != (i == 3 && d == 6) * 4) > >> > > >> __builtin_abort (); > >> > > >> return 0; > >> > > >> > >> > > >> the loop nest is illegal for vectorization without reversing > >> > > >> inner loop. The issue > >> > > >> is in data dependence checking of vectorizer, I believe the > >> > > >> mentioned revision just > >> > > >> exposed this. Previously the vectorization is skipped because > >> > > >> of unsupported memory > >> > > >> operation. The outer loop vectorization unrolls the outer loop > >> > > >> into: > >> > > >> > >> > > >> for (b = 4; b > 0; b -= 4) > >> > > >> { > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][6] = a[c][5]; > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][5] = a[c][4]; > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][4] = a[c][3]; > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> a[c + 1][3] = a[c][2]; > >> > > >> } > >> > > >> Then four inner loops are fused into: > >> > > >> for (b = 4; b > 0; b -= 4) > >> > > >> { > >> > > >> for (c = 0; c <= 6; c++) > >> > > >> { > >> > > >> a[c + 1][6] = a[c][5]; // S1 > >> > > >> a[c + 1][5] = a[c][4]; // S2 > >> > > >> a[c + 1][4] = a[c][3]; > >> > > >> a[c + 1][3] = a[c][2]; > >> > > >> } > >> > > >> } > >> > > > > >> > > > Note that they are not really "fused" but they are interleaved. > >> > > > With > >> > > > GIMPLE in mind > >> > > > that makes a difference, you should get the equivalent of > >> > > > > >> > > >for (c = 0; c <= 6; c++) > >> > > > { > >> > > >tem1 = a[c][5]; > >> > > >tem2 = a[c][4]; > >> > > >tem3 = a[c][3]; > >> > > >tem4 = a[c][2]; > >> > > >a[c+1][6] = tem1; > >> > > >a[c +1][5] = tem2; > >> > > > a[c+1][4] = tem3; > >> > > > a[c+1][3] = tem4; > >> > > > } > >> > > Yeah, I will double check if this abstract breaks the patch and > >> > > how. > >> > > >>> Hmm, I think this doesn't break it, well at least for part of the > >> > > >>> analysis, because it is loop carried (backward) dependence goes > >> > > >>> wrong, > >> > > >>> interleaving or not with the same iteration doesn't matter here. > >> > > >> > >> > > >> I think the idea is that forward dependences are always fine > >> > > >> (negative distance) > >> > > >> to vectorize. But with backward dependences we have to adhere to > >> > > >> max_vf. > >> > > >> > >> > > >> It looks like for outer loop vectorization we only look at the > >> > > >> distances in the > >> > > >> outer loop but never at inner ones. But here the same applies but > >> > > >> isn't that > >> > > >> independend on the distances with respect to the outer loop? > >> > > >> > >> > > >> But maybe I'm misunderstanding how "distances" work here. > >> > > > Hmm, I am not sure I understand "distance" correctly. With > >> > > > description as in book like "Optimizing compilers for Modern > >> > > > Architectures", distance is "# of iteration of sink ref - # of > >> > > > iteration of source ref". Given below example: > >> > > > for (i = 0; i < N; ++i) > >> > > > { > >> > > > x = arr[idx_1]; // S1 > >> > > > arr[idx_2] = x; // S2 > >> > > > } > >> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. > >> > > > Also > >> > > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is > >> > > > i; > >> > > > If S2 is source ref, distance = idx_1 -
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
Richard Biener writes: > On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng wrote: >> >> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener >> wrote: >> > >> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener >> > wrote: >> > > >> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng wrote: >> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener >> > > > wrote: >> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng >> > > >> wrote: >> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng >> > > >>> wrote: >> > > On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener >> > > wrote: >> > > > On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng >> > > > wrote: >> > > >> Hi, >> > > >> As explained in the PR, given below test case: >> > > >> int a[8][10] = { [2][5] = 4 }, c; >> > > >> >> > > >> int >> > > >> main () >> > > >> { >> > > >> short b; >> > > >> int i, d; >> > > >> for (b = 4; b >= 0; b--) >> > > >> for (c = 0; c <= 6; c++) >> > > >> a[c + 1][b + 2] = a[c][b + 1]; >> > > >> for (i = 0; i < 8; i++) >> > > >> for (d = 0; d < 10; d++) >> > > >> if (a[i][d] != (i == 3 && d == 6) * 4) >> > > >> __builtin_abort (); >> > > >> return 0; >> > > >> >> > > >> the loop nest is illegal for vectorization without reversing >> > > >> inner loop. The issue >> > > >> is in data dependence checking of vectorizer, I believe the >> > > >> mentioned revision just >> > > >> exposed this. Previously the vectorization is skipped because of >> > > >> unsupported memory >> > > >> operation. The outer loop vectorization unrolls the outer loop >> > > >> into: >> > > >> >> > > >> for (b = 4; b > 0; b -= 4) >> > > >> { >> > > >> for (c = 0; c <= 6; c++) >> > > >> a[c + 1][6] = a[c][5]; >> > > >> for (c = 0; c <= 6; c++) >> > > >> a[c + 1][5] = a[c][4]; >> > > >> for (c = 0; c <= 6; c++) >> > > >> a[c + 1][4] = a[c][3]; >> > > >> for (c = 0; c <= 6; c++) >> > > >> a[c + 1][3] = a[c][2]; >> > > >> } >> > > >> Then four inner loops are fused into: >> > > >> for (b = 4; b > 0; b -= 4) >> > > >> { >> > > >> for (c = 0; c <= 6; c++) >> > > >> { >> > > >> a[c + 1][6] = a[c][5]; // S1 >> > > >> a[c + 1][5] = a[c][4]; // S2 >> > > >> a[c + 1][4] = a[c][3]; >> > > >> a[c + 1][3] = a[c][2]; >> > > >> } >> > > >> } >> > > > >> > > > Note that they are not really "fused" but they are interleaved. >> > > > With >> > > > GIMPLE in mind >> > > > that makes a difference, you should get the equivalent of >> > > > >> > > >for (c = 0; c <= 6; c++) >> > > > { >> > > >tem1 = a[c][5]; >> > > >tem2 = a[c][4]; >> > > >tem3 = a[c][3]; >> > > >tem4 = a[c][2]; >> > > >a[c+1][6] = tem1; >> > > >a[c +1][5] = tem2; >> > > > a[c+1][4] = tem3; >> > > > a[c+1][3] = tem4; >> > > > } >> > > Yeah, I will double check if this abstract breaks the patch and how. >> > > >>> Hmm, I think this doesn't break it, well at least for part of the >> > > >>> analysis, because it is loop carried (backward) dependence goes >> > > >>> wrong, >> > > >>> interleaving or not with the same iteration doesn't matter here. >> > > >> >> > > >> I think the idea is that forward dependences are always fine >> > > >> (negative distance) >> > > >> to vectorize. But with backward dependences we have to adhere to >> > > >> max_vf. >> > > >> >> > > >> It looks like for outer loop vectorization we only look at the >> > > >> distances in the >> > > >> outer loop but never at inner ones. But here the same applies but >> > > >> isn't that >> > > >> independend on the distances with respect to the outer loop? >> > > >> >> > > >> But maybe I'm misunderstanding how "distances" work here. >> > > > Hmm, I am not sure I understand "distance" correctly. With >> > > > description as in book like "Optimizing compilers for Modern >> > > > Architectures", distance is "# of iteration of sink ref - # of >> > > > iteration of source ref". Given below example: >> > > > for (i = 0; i < N; ++i) >> > > > { >> > > > x = arr[idx_1]; // S1 >> > > > arr[idx_2] = x; // S2 >> > > > } >> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also >> > > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is >> > > > i; >> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also >> > > > this is backward dependence. For example idx_1 is i and idx_2 is i + >> > > > 1; >> > > > >> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing >> > > > classic distance, we could result in negative distance in case of >> > > > backward dependence. When this happens
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng wrote: > > On Thu, Mar 21, 2019 at 8:24 PM Richard Biener > wrote: > > > > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener > > wrote: > > > > > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng wrote: > > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener > > > > wrote: > > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng > > > >> wrote: > > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng > > > >>> wrote: > > > On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > > > wrote: > > > > On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng > > > > wrote: > > > >> Hi, > > > >> As explained in the PR, given below test case: > > > >> int a[8][10] = { [2][5] = 4 }, c; > > > >> > > > >> int > > > >> main () > > > >> { > > > >> short b; > > > >> int i, d; > > > >> for (b = 4; b >= 0; b--) > > > >> for (c = 0; c <= 6; c++) > > > >> a[c + 1][b + 2] = a[c][b + 1]; > > > >> for (i = 0; i < 8; i++) > > > >> for (d = 0; d < 10; d++) > > > >> if (a[i][d] != (i == 3 && d == 6) * 4) > > > >> __builtin_abort (); > > > >> return 0; > > > >> > > > >> the loop nest is illegal for vectorization without reversing inner > > > >> loop. The issue > > > >> is in data dependence checking of vectorizer, I believe the > > > >> mentioned revision just > > > >> exposed this. Previously the vectorization is skipped because of > > > >> unsupported memory > > > >> operation. The outer loop vectorization unrolls the outer loop > > > >> into: > > > >> > > > >> for (b = 4; b > 0; b -= 4) > > > >> { > > > >> for (c = 0; c <= 6; c++) > > > >> a[c + 1][6] = a[c][5]; > > > >> for (c = 0; c <= 6; c++) > > > >> a[c + 1][5] = a[c][4]; > > > >> for (c = 0; c <= 6; c++) > > > >> a[c + 1][4] = a[c][3]; > > > >> for (c = 0; c <= 6; c++) > > > >> a[c + 1][3] = a[c][2]; > > > >> } > > > >> Then four inner loops are fused into: > > > >> for (b = 4; b > 0; b -= 4) > > > >> { > > > >> for (c = 0; c <= 6; c++) > > > >> { > > > >> a[c + 1][6] = a[c][5]; // S1 > > > >> a[c + 1][5] = a[c][4]; // S2 > > > >> a[c + 1][4] = a[c][3]; > > > >> a[c + 1][3] = a[c][2]; > > > >> } > > > >> } > > > > > > > > Note that they are not really "fused" but they are interleaved. > > > > With > > > > GIMPLE in mind > > > > that makes a difference, you should get the equivalent of > > > > > > > >for (c = 0; c <= 6; c++) > > > > { > > > >tem1 = a[c][5]; > > > >tem2 = a[c][4]; > > > >tem3 = a[c][3]; > > > >tem4 = a[c][2]; > > > >a[c+1][6] = tem1; > > > >a[c +1][5] = tem2; > > > > a[c+1][4] = tem3; > > > > a[c+1][3] = tem4; > > > > } > > > Yeah, I will double check if this abstract breaks the patch and how. > > > >>> Hmm, I think this doesn't break it, well at least for part of the > > > >>> analysis, because it is loop carried (backward) dependence goes wrong, > > > >>> interleaving or not with the same iteration doesn't matter here. > > > >> > > > >> I think the idea is that forward dependences are always fine (negative > > > >> distance) > > > >> to vectorize. But with backward dependences we have to adhere to > > > >> max_vf. > > > >> > > > >> It looks like for outer loop vectorization we only look at the > > > >> distances in the > > > >> outer loop but never at inner ones. But here the same applies but > > > >> isn't that > > > >> independend on the distances with respect to the outer loop? > > > >> > > > >> But maybe I'm misunderstanding how "distances" work here. > > > > Hmm, I am not sure I understand "distance" correctly. With > > > > description as in book like "Optimizing compilers for Modern > > > > Architectures", distance is "# of iteration of sink ref - # of > > > > iteration of source ref". Given below example: > > > > for (i = 0; i < N; ++i) > > > > { > > > > x = arr[idx_1]; // S1 > > > > arr[idx_2] = x; // S2 > > > > } > > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also > > > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is > > > > i; > > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also > > > > this is backward dependence. For example idx_1 is i and idx_2 is i + > > > > 1; > > > > > > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing > > > > classic distance, we could result in negative distance in case of > > > > backward dependence. When this happens at dependence carried level, > > > > we manually reverse it. When this happens at inner level loop, we > > > > simply keep the negative distance. And it's
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Thu, Mar 21, 2019 at 8:24 PM Richard Biener wrote: > > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener > wrote: > > > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng wrote: > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener > > > wrote: > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng wrote: > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng > > >>> wrote: > > On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > > wrote: > > > On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng wrote: > > >> Hi, > > >> As explained in the PR, given below test case: > > >> int a[8][10] = { [2][5] = 4 }, c; > > >> > > >> int > > >> main () > > >> { > > >> short b; > > >> int i, d; > > >> for (b = 4; b >= 0; b--) > > >> for (c = 0; c <= 6; c++) > > >> a[c + 1][b + 2] = a[c][b + 1]; > > >> for (i = 0; i < 8; i++) > > >> for (d = 0; d < 10; d++) > > >> if (a[i][d] != (i == 3 && d == 6) * 4) > > >> __builtin_abort (); > > >> return 0; > > >> > > >> the loop nest is illegal for vectorization without reversing inner > > >> loop. The issue > > >> is in data dependence checking of vectorizer, I believe the > > >> mentioned revision just > > >> exposed this. Previously the vectorization is skipped because of > > >> unsupported memory > > >> operation. The outer loop vectorization unrolls the outer loop into: > > >> > > >> for (b = 4; b > 0; b -= 4) > > >> { > > >> for (c = 0; c <= 6; c++) > > >> a[c + 1][6] = a[c][5]; > > >> for (c = 0; c <= 6; c++) > > >> a[c + 1][5] = a[c][4]; > > >> for (c = 0; c <= 6; c++) > > >> a[c + 1][4] = a[c][3]; > > >> for (c = 0; c <= 6; c++) > > >> a[c + 1][3] = a[c][2]; > > >> } > > >> Then four inner loops are fused into: > > >> for (b = 4; b > 0; b -= 4) > > >> { > > >> for (c = 0; c <= 6; c++) > > >> { > > >> a[c + 1][6] = a[c][5]; // S1 > > >> a[c + 1][5] = a[c][4]; // S2 > > >> a[c + 1][4] = a[c][3]; > > >> a[c + 1][3] = a[c][2]; > > >> } > > >> } > > > > > > Note that they are not really "fused" but they are interleaved. With > > > GIMPLE in mind > > > that makes a difference, you should get the equivalent of > > > > > >for (c = 0; c <= 6; c++) > > > { > > >tem1 = a[c][5]; > > >tem2 = a[c][4]; > > >tem3 = a[c][3]; > > >tem4 = a[c][2]; > > >a[c+1][6] = tem1; > > >a[c +1][5] = tem2; > > > a[c+1][4] = tem3; > > > a[c+1][3] = tem4; > > > } > > Yeah, I will double check if this abstract breaks the patch and how. > > >>> Hmm, I think this doesn't break it, well at least for part of the > > >>> analysis, because it is loop carried (backward) dependence goes wrong, > > >>> interleaving or not with the same iteration doesn't matter here. > > >> > > >> I think the idea is that forward dependences are always fine (negative > > >> distance) > > >> to vectorize. But with backward dependences we have to adhere to max_vf. > > >> > > >> It looks like for outer loop vectorization we only look at the distances > > >> in the > > >> outer loop but never at inner ones. But here the same applies but isn't > > >> that > > >> independend on the distances with respect to the outer loop? > > >> > > >> But maybe I'm misunderstanding how "distances" work here. > > > Hmm, I am not sure I understand "distance" correctly. With > > > description as in book like "Optimizing compilers for Modern > > > Architectures", distance is "# of iteration of sink ref - # of > > > iteration of source ref". Given below example: > > > for (i = 0; i < N; ++i) > > > { > > > x = arr[idx_1]; // S1 > > > arr[idx_2] = x; // S2 > > > } > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also > > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is > > > i; > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also > > > this is backward dependence. For example idx_1 is i and idx_2 is i + > > > 1; > > > > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing > > > classic distance, we could result in negative distance in case of > > > backward dependence. When this happens at dependence carried level, > > > we manually reverse it. When this happens at inner level loop, we > > > simply keep the negative distance. And it's meaningless to talk about > > > forward/backward given dependence is carried at outer level loop. > > > > > > Outer loop vectorization is interesting. The problematic case has > > > backward dependence carried by outer loop. Because we don't check > > > dist vs. max_vf for such dep, the unrolled references could have outer > > > loop index
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Mon, Dec 18, 2017 at 1:37 PM Richard Biener wrote: > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng wrote: > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener > > wrote: > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng wrote: > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng wrote: > On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > wrote: > > On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng wrote: > >> Hi, > >> As explained in the PR, given below test case: > >> int a[8][10] = { [2][5] = 4 }, c; > >> > >> int > >> main () > >> { > >> short b; > >> int i, d; > >> for (b = 4; b >= 0; b--) > >> for (c = 0; c <= 6; c++) > >> a[c + 1][b + 2] = a[c][b + 1]; > >> for (i = 0; i < 8; i++) > >> for (d = 0; d < 10; d++) > >> if (a[i][d] != (i == 3 && d == 6) * 4) > >> __builtin_abort (); > >> return 0; > >> > >> the loop nest is illegal for vectorization without reversing inner > >> loop. The issue > >> is in data dependence checking of vectorizer, I believe the mentioned > >> revision just > >> exposed this. Previously the vectorization is skipped because of > >> unsupported memory > >> operation. The outer loop vectorization unrolls the outer loop into: > >> > >> for (b = 4; b > 0; b -= 4) > >> { > >> for (c = 0; c <= 6; c++) > >> a[c + 1][6] = a[c][5]; > >> for (c = 0; c <= 6; c++) > >> a[c + 1][5] = a[c][4]; > >> for (c = 0; c <= 6; c++) > >> a[c + 1][4] = a[c][3]; > >> for (c = 0; c <= 6; c++) > >> a[c + 1][3] = a[c][2]; > >> } > >> Then four inner loops are fused into: > >> for (b = 4; b > 0; b -= 4) > >> { > >> for (c = 0; c <= 6; c++) > >> { > >> a[c + 1][6] = a[c][5]; // S1 > >> a[c + 1][5] = a[c][4]; // S2 > >> a[c + 1][4] = a[c][3]; > >> a[c + 1][3] = a[c][2]; > >> } > >> } > > > > Note that they are not really "fused" but they are interleaved. With > > GIMPLE in mind > > that makes a difference, you should get the equivalent of > > > >for (c = 0; c <= 6; c++) > > { > >tem1 = a[c][5]; > >tem2 = a[c][4]; > >tem3 = a[c][3]; > >tem4 = a[c][2]; > >a[c+1][6] = tem1; > >a[c +1][5] = tem2; > > a[c+1][4] = tem3; > > a[c+1][3] = tem4; > > } > Yeah, I will double check if this abstract breaks the patch and how. > >>> Hmm, I think this doesn't break it, well at least for part of the > >>> analysis, because it is loop carried (backward) dependence goes wrong, > >>> interleaving or not with the same iteration doesn't matter here. > >> > >> I think the idea is that forward dependences are always fine (negative > >> distance) > >> to vectorize. But with backward dependences we have to adhere to max_vf. > >> > >> It looks like for outer loop vectorization we only look at the distances > >> in the > >> outer loop but never at inner ones. But here the same applies but isn't > >> that > >> independend on the distances with respect to the outer loop? > >> > >> But maybe I'm misunderstanding how "distances" work here. > > Hmm, I am not sure I understand "distance" correctly. With > > description as in book like "Optimizing compilers for Modern > > Architectures", distance is "# of iteration of sink ref - # of > > iteration of source ref". Given below example: > > for (i = 0; i < N; ++i) > > { > > x = arr[idx_1]; // S1 > > arr[idx_2] = x; // S2 > > } > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is > > i; > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also > > this is backward dependence. For example idx_1 is i and idx_2 is i + > > 1; > > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing > > classic distance, we could result in negative distance in case of > > backward dependence. When this happens at dependence carried level, > > we manually reverse it. When this happens at inner level loop, we > > simply keep the negative distance. And it's meaningless to talk about > > forward/backward given dependence is carried at outer level loop. > > > > Outer loop vectorization is interesting. The problematic case has > > backward dependence carried by outer loop. Because we don't check > > dist vs. max_vf for such dep, the unrolled references could have outer > > loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's > > like we have outer loop index resolved as equal. Now it has > > dependence only if c == c' + 1. I think previous comment on fusion > > still hold for this and we now need to check backward dependence > > between the two reference at inner
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Tue, Dec 19, 2017 at 12:56 PM, Richard Bienerwrote: > On Tue, Dec 19, 2017 at 12:58 PM, Bin.Cheng wrote: >> On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz wrote: >>> Hi, >>> >>> On Mon, 18 Dec 2017, Richard Biener wrote: >>> where *unroll is similar to *max_vf I think. dist_v[0] is the innermost loop. >>> >>> [0] is always outermost loop. >>> The vectorizer does way more complicated things and only looks at the distance with respect to the outer loop as far as I can see which can be negative. Not sure if fusion and vectorizer "interleaving" makes a difference here. I think the idea was that when interleaving stmt-by-stmt then forward dependences would be preserved and thus we don't need to check the inner loop dependences. speaking with "forward vs. backward" dependences again, not distances... This also means that unroll-and-jam could be enhanced to "interleave" stmts and thus cover more cases? Again, I hope Micha can have a look here... >>> >>> Haven't yet looked at the patch, but some comments anyway: >>> >>> fusion and interleaving interact in the following way in outer loop >>> vectorization, conceptually: >>> * (1) the outer loop is unrolled >>> * (2) the inner loops are fused >>> * (3) the (now single) inner body is rescheduled/shuffled/interleaved. >>> >> Thanks Michael for explaining issue clearer, this is what I meant. As >> for PR60276, I think it's actually the other side of the problem, >> which only relates to dependence validity of interleaving. > > Interleaving validity is what is checked by the current code, I don't > see any checking for validity of (2). Now, the current code only > looks at the outer loop distances to verify interleaving validity. > > I think we need to verify whether fusion is valid, preferably clearly > separated from the current code checking interleaving validity. > > I'm not 100% convinced the interleaving validity check is correct for > the outer loop vectorization case. > > I think it helps to reduce the dependence checking code to what we do > for unroll-and-jam: > > Index: gcc/tree-vect-data-refs.c > === > --- gcc/tree-vect-data-refs.c (revision 255777) > +++ gcc/tree-vect-data-refs.c (working copy) > @@ -378,7 +378,26 @@ vect_analyze_data_ref_dependence (struct > dump_printf_loc (MSG_NOTE, vect_location, > "dependence distance = %d.\n", dist); > > - if (dist == 0) > + if (dist < 0) > + gcc_unreachable (); > + > + else if (dist >= *max_vf) > + { > + /* Dependence distance does not create dependence, as far as > +vectorization is concerned, in this case. */ > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > +"dependence distance >= VF.\n"); > + continue; > + } > + > + else if (DDR_NB_LOOPS (ddr) == 2 > + && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > + || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > + && dist > 0))) > + continue; > + > + else if (dist == 0) > { > if (dump_enabled_p ()) > { > @@ -427,26 +446,7 @@ vect_analyze_data_ref_dependence (struct > continue; > } > > - if (dist > 0 && DDR_REVERSED_P (ddr)) > - { > - /* If DDR_REVERSED_P the order of the data-refs in DDR was > -reversed (to make distance vector positive), and the actual > -distance is negative. */ > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > -"dependence distance negative.\n"); > - /* Record a negative dependence distance to later limit the > -amount of stmt copying / unrolling we can perform. > -Only need to handle read-after-write dependence. */ > - if (DR_IS_READ (drb) > - && (STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) == 0 > - || STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) > (unsigned)dist)) > - STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) = dist; > - continue; > - } Simply removing DDR_REVERSED_P doesn't work, as below failures indicating. > - > - if (abs (dist) >= 2 > - && abs (dist) < *max_vf) > + if (dist >= 2) > { > /* The dependence distance requires reduction of the maximal > vectorization factor. */ > @@ -455,15 +455,6 @@ vect_analyze_data_ref_dependence (struct > dump_printf_loc (MSG_NOTE, vect_location, > "adjusting maximal vectorization factor to %i\n", > *max_vf); > - } > - > - if (abs (dist) >= *max_vf) >
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Tue, Dec 19, 2017 at 12:58 PM, Bin.Chengwrote: > On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz wrote: >> Hi, >> >> On Mon, 18 Dec 2017, Richard Biener wrote: >> >>> where *unroll is similar to *max_vf I think. dist_v[0] is the innermost >>> loop. >> >> [0] is always outermost loop. >> >>> The vectorizer does way more complicated things and only looks at the >>> distance with respect to the outer loop as far as I can see which can be >>> negative. >>> >>> Not sure if fusion and vectorizer "interleaving" makes a difference >>> here. I think the idea was that when interleaving stmt-by-stmt then >>> forward dependences would be preserved and thus we don't need to check >>> the inner loop dependences. speaking with "forward vs. backward" >>> dependences again, not distances... >>> >>> This also means that unroll-and-jam could be enhanced to "interleave" >>> stmts and thus cover more cases? >>> >>> Again, I hope Micha can have a look here... >> >> Haven't yet looked at the patch, but some comments anyway: >> >> fusion and interleaving interact in the following way in outer loop >> vectorization, conceptually: >> * (1) the outer loop is unrolled >> * (2) the inner loops are fused >> * (3) the (now single) inner body is rescheduled/shuffled/interleaved. >> > Thanks Michael for explaining issue clearer, this is what I meant. As > for PR60276, I think it's actually the other side of the problem, > which only relates to dependence validity of interleaving. Interleaving validity is what is checked by the current code, I don't see any checking for validity of (2). Now, the current code only looks at the outer loop distances to verify interleaving validity. I think we need to verify whether fusion is valid, preferably clearly separated from the current code checking interleaving validity. I'm not 100% convinced the interleaving validity check is correct for the outer loop vectorization case. I think it helps to reduce the dependence checking code to what we do for unroll-and-jam: Index: gcc/tree-vect-data-refs.c === --- gcc/tree-vect-data-refs.c (revision 255777) +++ gcc/tree-vect-data-refs.c (working copy) @@ -378,7 +378,26 @@ vect_analyze_data_ref_dependence (struct dump_printf_loc (MSG_NOTE, vect_location, "dependence distance = %d.\n", dist); - if (dist == 0) + if (dist < 0) + gcc_unreachable (); + + else if (dist >= *max_vf) + { + /* Dependence distance does not create dependence, as far as +vectorization is concerned, in this case. */ + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, +"dependence distance >= VF.\n"); + continue; + } + + else if (DDR_NB_LOOPS (ddr) == 2 + && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1) + || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1) + && dist > 0))) + continue; + + else if (dist == 0) { if (dump_enabled_p ()) { @@ -427,26 +446,7 @@ vect_analyze_data_ref_dependence (struct continue; } - if (dist > 0 && DDR_REVERSED_P (ddr)) - { - /* If DDR_REVERSED_P the order of the data-refs in DDR was -reversed (to make distance vector positive), and the actual -distance is negative. */ - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, -"dependence distance negative.\n"); - /* Record a negative dependence distance to later limit the -amount of stmt copying / unrolling we can perform. -Only need to handle read-after-write dependence. */ - if (DR_IS_READ (drb) - && (STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) == 0 - || STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) > (unsigned)dist)) - STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) = dist; - continue; - } - - if (abs (dist) >= 2 - && abs (dist) < *max_vf) + if (dist >= 2) { /* The dependence distance requires reduction of the maximal vectorization factor. */ @@ -455,15 +455,6 @@ vect_analyze_data_ref_dependence (struct dump_printf_loc (MSG_NOTE, vect_location, "adjusting maximal vectorization factor to %i\n", *max_vf); - } - - if (abs (dist) >= *max_vf) - { - /* Dependence distance does not create dependence, as far as -vectorization is concerned, in this case. */ - if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, -"dependence distance >= VF.\n"); continue; } that fixes the testcase (probably by
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Mon, Dec 18, 2017 at 2:35 PM, Michael Matzwrote: > Hi, > > On Mon, 18 Dec 2017, Richard Biener wrote: > >> where *unroll is similar to *max_vf I think. dist_v[0] is the innermost >> loop. > > [0] is always outermost loop. > >> The vectorizer does way more complicated things and only looks at the >> distance with respect to the outer loop as far as I can see which can be >> negative. >> >> Not sure if fusion and vectorizer "interleaving" makes a difference >> here. I think the idea was that when interleaving stmt-by-stmt then >> forward dependences would be preserved and thus we don't need to check >> the inner loop dependences. speaking with "forward vs. backward" >> dependences again, not distances... >> >> This also means that unroll-and-jam could be enhanced to "interleave" >> stmts and thus cover more cases? >> >> Again, I hope Micha can have a look here... > > Haven't yet looked at the patch, but some comments anyway: > > fusion and interleaving interact in the following way in outer loop > vectorization, conceptually: > * (1) the outer loop is unrolled > * (2) the inner loops are fused > * (3) the (now single) inner body is rescheduled/shuffled/interleaved. > Thanks Michael for explaining issue clearer, this is what I meant. As for PR60276, I think it's actually the other side of the problem, which only relates to dependence validity of interleaving. Thanks, bin > (1) is always okay. But (2) and (3) as individual transformations must > both be checked for validity. If fusion isn't possible the whole > transformation is invalid, and if interleaving isn't possible the same is > true. In the specific example: > > for (b = 4; b >= 0; b--) > for (c = 0; c <= 6; c++) > t = a[c][b + 1]; // S1 > a[c + 1][b + 2] = t; // S2 > > it's already the fusion step that's invalid. There's a > dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0) > with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5]. So a > write-after-read. After fusing: > >for (c = 0; c <= 6; c++) > { >t = a[c][5]; // S1 >a[c + 1][6] = t; >t = a[c][4]; >a[c + 1][5] = t; // S2 >a[c + 1][4] = a[c][3]; >a[c + 1][3] = a[c][2]; > } > > here we have at iterations (c) = (0) comes-before (1), at S2(0) writing > a[1][5] and S1(1) writing a[1][5]. I.e. now it's a read-after-write (the > write in iteration 0 overwrites the value that is going to be read at > iteration 1, which wasn't the case in the original loop). The dependence > switched direction --> invalid. > > The simple interleaving of statements can't rectify this. > Interleaving is an inner-body reordering but the brokenness comes from a > cross-iteration ordering. > > This example can be unroll-jammed or outer-loop vectorized if one of the > two loops is reversed. Let's say we reverse the inner loop, so that it > runs in the same direction as the outer loop (reversal is possible here). > > It'd then be something like: > >for (c = 6; c >= 0; c--) > { >t = a[c][5]; // S1 >a[c + 1][6] = t; >t = a[c][4]; >a[c + 1][5] = t; // S2 >a[c + 1][4] = a[c][3]; >a[c + 1][3] = a[c][2]; > } > > The dependence between S1/S2 would still be a write-after-read, and all > would be well. This reversal of the inner loop can partly be simulated by > not only interleaving the inner insns, but by also _reodering_ them. But > AFAIK the vectorizer doesn't do this? > > > Ciao, > Michael.
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
Hi, On Mon, 18 Dec 2017, Richard Biener wrote: > where *unroll is similar to *max_vf I think. dist_v[0] is the innermost loop. [0] is always outermost loop. > The vectorizer does way more complicated things and only looks at the > distance with respect to the outer loop as far as I can see which can be > negative. > > Not sure if fusion and vectorizer "interleaving" makes a difference > here. I think the idea was that when interleaving stmt-by-stmt then > forward dependences would be preserved and thus we don't need to check > the inner loop dependences. speaking with "forward vs. backward" > dependences again, not distances... > > This also means that unroll-and-jam could be enhanced to "interleave" > stmts and thus cover more cases? > > Again, I hope Micha can have a look here... Haven't yet looked at the patch, but some comments anyway: fusion and interleaving interact in the following way in outer loop vectorization, conceptually: * (1) the outer loop is unrolled * (2) the inner loops are fused * (3) the (now single) inner body is rescheduled/shuffled/interleaved. (1) is always okay. But (2) and (3) as individual transformations must both be checked for validity. If fusion isn't possible the whole transformation is invalid, and if interleaving isn't possible the same is true. In the specific example: for (b = 4; b >= 0; b--) for (c = 0; c <= 6; c++) t = a[c][b + 1]; // S1 a[c + 1][b + 2] = t; // S2 it's already the fusion step that's invalid. There's a dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0) with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5]. So a write-after-read. After fusing: for (c = 0; c <= 6; c++) { t = a[c][5]; // S1 a[c + 1][6] = t; t = a[c][4]; a[c + 1][5] = t; // S2 a[c + 1][4] = a[c][3]; a[c + 1][3] = a[c][2]; } here we have at iterations (c) = (0) comes-before (1), at S2(0) writing a[1][5] and S1(1) writing a[1][5]. I.e. now it's a read-after-write (the write in iteration 0 overwrites the value that is going to be read at iteration 1, which wasn't the case in the original loop). The dependence switched direction --> invalid. The simple interleaving of statements can't rectify this. Interleaving is an inner-body reordering but the brokenness comes from a cross-iteration ordering. This example can be unroll-jammed or outer-loop vectorized if one of the two loops is reversed. Let's say we reverse the inner loop, so that it runs in the same direction as the outer loop (reversal is possible here). It'd then be something like: for (c = 6; c >= 0; c--) { t = a[c][5]; // S1 a[c + 1][6] = t; t = a[c][4]; a[c + 1][5] = t; // S2 a[c + 1][4] = a[c][3]; a[c + 1][3] = a[c][2]; } The dependence between S1/S2 would still be a write-after-read, and all would be well. This reversal of the inner loop can partly be simulated by not only interleaving the inner insns, but by also _reodering_ them. But AFAIK the vectorizer doesn't do this? Ciao, Michael.
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Mon, Dec 18, 2017 at 1:37 PM, Richard Bienerwrote: > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng wrote: >> On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener >> wrote: >>> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng wrote: On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng wrote: > On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > wrote: >> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng wrote: >>> Hi, >>> As explained in the PR, given below test case: >>> int a[8][10] = { [2][5] = 4 }, c; >>> >>> int >>> main () >>> { >>> short b; >>> int i, d; >>> for (b = 4; b >= 0; b--) >>> for (c = 0; c <= 6; c++) >>> a[c + 1][b + 2] = a[c][b + 1]; >>> for (i = 0; i < 8; i++) >>> for (d = 0; d < 10; d++) >>> if (a[i][d] != (i == 3 && d == 6) * 4) >>> __builtin_abort (); >>> return 0; >>> >>> the loop nest is illegal for vectorization without reversing inner >>> loop. The issue >>> is in data dependence checking of vectorizer, I believe the mentioned >>> revision just >>> exposed this. Previously the vectorization is skipped because of >>> unsupported memory >>> operation. The outer loop vectorization unrolls the outer loop into: >>> >>> for (b = 4; b > 0; b -= 4) >>> { >>> for (c = 0; c <= 6; c++) >>> a[c + 1][6] = a[c][5]; >>> for (c = 0; c <= 6; c++) >>> a[c + 1][5] = a[c][4]; >>> for (c = 0; c <= 6; c++) >>> a[c + 1][4] = a[c][3]; >>> for (c = 0; c <= 6; c++) >>> a[c + 1][3] = a[c][2]; >>> } >>> Then four inner loops are fused into: >>> for (b = 4; b > 0; b -= 4) >>> { >>> for (c = 0; c <= 6; c++) >>> { >>> a[c + 1][6] = a[c][5]; // S1 >>> a[c + 1][5] = a[c][4]; // S2 >>> a[c + 1][4] = a[c][3]; >>> a[c + 1][3] = a[c][2]; >>> } >>> } >> >> Note that they are not really "fused" but they are interleaved. With >> GIMPLE in mind >> that makes a difference, you should get the equivalent of >> >>for (c = 0; c <= 6; c++) >> { >>tem1 = a[c][5]; >>tem2 = a[c][4]; >>tem3 = a[c][3]; >>tem4 = a[c][2]; >>a[c+1][6] = tem1; >>a[c +1][5] = tem2; >> a[c+1][4] = tem3; >> a[c+1][3] = tem4; >> } > Yeah, I will double check if this abstract breaks the patch and how. Hmm, I think this doesn't break it, well at least for part of the analysis, because it is loop carried (backward) dependence goes wrong, interleaving or not with the same iteration doesn't matter here. >>> >>> I think the idea is that forward dependences are always fine (negative >>> distance) >>> to vectorize. But with backward dependences we have to adhere to max_vf. >>> >>> It looks like for outer loop vectorization we only look at the distances in >>> the >>> outer loop but never at inner ones. But here the same applies but isn't >>> that >>> independend on the distances with respect to the outer loop? >>> >>> But maybe I'm misunderstanding how "distances" work here. >> Hmm, I am not sure I understand "distance" correctly. With >> description as in book like "Optimizing compilers for Modern >> Architectures", distance is "# of iteration of sink ref - # of >> iteration of source ref". Given below example: >> for (i = 0; i < N; ++i) >> { >> x = arr[idx_1]; // S1 >> arr[idx_2] = x; // S2 >> } >> if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also >> this is forward dependence. For example, idx_1 is i + 1 and idx_2 is >> i; >> If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also >> this is backward dependence. For example idx_1 is i and idx_2 is i + >> 1; >> >> In GCC, we always try to subtract idx_2 from idx_1 first in computing >> classic distance, we could result in negative distance in case of >> backward dependence. When this happens at dependence carried level, >> we manually reverse it. When this happens at inner level loop, we >> simply keep the negative distance. And it's meaningless to talk about >> forward/backward given dependence is carried at outer level loop. >> >> Outer loop vectorization is interesting. The problematic case has >> backward dependence carried by outer loop. Because we don't check >> dist vs. max_vf for such dep, the unrolled references could have outer >> loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's >> like we have outer loop index resolved as equal. Now it has >> dependence only if c == c' + 1. I think previous comment on fusion >> still hold for this and we now need to check
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Fri, Dec 15, 2017 at 7:39 PM, Bin.Chengwrote: > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener > wrote: >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng wrote: >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng wrote: On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener wrote: > On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng wrote: >> Hi, >> As explained in the PR, given below test case: >> int a[8][10] = { [2][5] = 4 }, c; >> >> int >> main () >> { >> short b; >> int i, d; >> for (b = 4; b >= 0; b--) >> for (c = 0; c <= 6; c++) >> a[c + 1][b + 2] = a[c][b + 1]; >> for (i = 0; i < 8; i++) >> for (d = 0; d < 10; d++) >> if (a[i][d] != (i == 3 && d == 6) * 4) >> __builtin_abort (); >> return 0; >> >> the loop nest is illegal for vectorization without reversing inner loop. >> The issue >> is in data dependence checking of vectorizer, I believe the mentioned >> revision just >> exposed this. Previously the vectorization is skipped because of >> unsupported memory >> operation. The outer loop vectorization unrolls the outer loop into: >> >> for (b = 4; b > 0; b -= 4) >> { >> for (c = 0; c <= 6; c++) >> a[c + 1][6] = a[c][5]; >> for (c = 0; c <= 6; c++) >> a[c + 1][5] = a[c][4]; >> for (c = 0; c <= 6; c++) >> a[c + 1][4] = a[c][3]; >> for (c = 0; c <= 6; c++) >> a[c + 1][3] = a[c][2]; >> } >> Then four inner loops are fused into: >> for (b = 4; b > 0; b -= 4) >> { >> for (c = 0; c <= 6; c++) >> { >> a[c + 1][6] = a[c][5]; // S1 >> a[c + 1][5] = a[c][4]; // S2 >> a[c + 1][4] = a[c][3]; >> a[c + 1][3] = a[c][2]; >> } >> } > > Note that they are not really "fused" but they are interleaved. With > GIMPLE in mind > that makes a difference, you should get the equivalent of > >for (c = 0; c <= 6; c++) > { >tem1 = a[c][5]; >tem2 = a[c][4]; >tem3 = a[c][3]; >tem4 = a[c][2]; >a[c+1][6] = tem1; >a[c +1][5] = tem2; > a[c+1][4] = tem3; > a[c+1][3] = tem4; > } Yeah, I will double check if this abstract breaks the patch and how. >>> Hmm, I think this doesn't break it, well at least for part of the >>> analysis, because it is loop carried (backward) dependence goes wrong, >>> interleaving or not with the same iteration doesn't matter here. >> >> I think the idea is that forward dependences are always fine (negative >> distance) >> to vectorize. But with backward dependences we have to adhere to max_vf. >> >> It looks like for outer loop vectorization we only look at the distances in >> the >> outer loop but never at inner ones. But here the same applies but isn't that >> independend on the distances with respect to the outer loop? >> >> But maybe I'm misunderstanding how "distances" work here. > Hmm, I am not sure I understand "distance" correctly. With > description as in book like "Optimizing compilers for Modern > Architectures", distance is "# of iteration of sink ref - # of > iteration of source ref". Given below example: > for (i = 0; i < N; ++i) > { > x = arr[idx_1]; // S1 > arr[idx_2] = x; // S2 > } > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is > i; > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also > this is backward dependence. For example idx_1 is i and idx_2 is i + > 1; > > In GCC, we always try to subtract idx_2 from idx_1 first in computing > classic distance, we could result in negative distance in case of > backward dependence. When this happens at dependence carried level, > we manually reverse it. When this happens at inner level loop, we > simply keep the negative distance. And it's meaningless to talk about > forward/backward given dependence is carried at outer level loop. > > Outer loop vectorization is interesting. The problematic case has > backward dependence carried by outer loop. Because we don't check > dist vs. max_vf for such dep, the unrolled references could have outer > loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's > like we have outer loop index resolved as equal. Now it has > dependence only if c == c' + 1. I think previous comment on fusion > still hold for this and we now need to check backward dependence > between the two reference at inner level loop. I guess it's a bit > trick to model dependence between a[c][5] and a[c+1][5] using the > original references and dependence. I think it's
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Fri, Dec 15, 2017 at 1:19 PM, Richard Bienerwrote: > On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng wrote: >> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng wrote: >>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener >>> wrote: On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng wrote: > Hi, > As explained in the PR, given below test case: > int a[8][10] = { [2][5] = 4 }, c; > > int > main () > { > short b; > int i, d; > for (b = 4; b >= 0; b--) > for (c = 0; c <= 6; c++) > a[c + 1][b + 2] = a[c][b + 1]; > for (i = 0; i < 8; i++) > for (d = 0; d < 10; d++) > if (a[i][d] != (i == 3 && d == 6) * 4) > __builtin_abort (); > return 0; > > the loop nest is illegal for vectorization without reversing inner loop. > The issue > is in data dependence checking of vectorizer, I believe the mentioned > revision just > exposed this. Previously the vectorization is skipped because of > unsupported memory > operation. The outer loop vectorization unrolls the outer loop into: > > for (b = 4; b > 0; b -= 4) > { > for (c = 0; c <= 6; c++) > a[c + 1][6] = a[c][5]; > for (c = 0; c <= 6; c++) > a[c + 1][5] = a[c][4]; > for (c = 0; c <= 6; c++) > a[c + 1][4] = a[c][3]; > for (c = 0; c <= 6; c++) > a[c + 1][3] = a[c][2]; > } > Then four inner loops are fused into: > for (b = 4; b > 0; b -= 4) > { > for (c = 0; c <= 6; c++) > { > a[c + 1][6] = a[c][5]; // S1 > a[c + 1][5] = a[c][4]; // S2 > a[c + 1][4] = a[c][3]; > a[c + 1][3] = a[c][2]; > } > } Note that they are not really "fused" but they are interleaved. With GIMPLE in mind that makes a difference, you should get the equivalent of for (c = 0; c <= 6; c++) { tem1 = a[c][5]; tem2 = a[c][4]; tem3 = a[c][3]; tem4 = a[c][2]; a[c+1][6] = tem1; a[c +1][5] = tem2; a[c+1][4] = tem3; a[c+1][3] = tem4; } >>> Yeah, I will double check if this abstract breaks the patch and how. >> Hmm, I think this doesn't break it, well at least for part of the >> analysis, because it is loop carried (backward) dependence goes wrong, >> interleaving or not with the same iteration doesn't matter here. > > I think the idea is that forward dependences are always fine (negative > distance) > to vectorize. But with backward dependences we have to adhere to max_vf. > > It looks like for outer loop vectorization we only look at the distances in > the > outer loop but never at inner ones. But here the same applies but isn't that > independend on the distances with respect to the outer loop? > > But maybe I'm misunderstanding how "distances" work here. Hmm, I am not sure I understand "distance" correctly. With description as in book like "Optimizing compilers for Modern Architectures", distance is "# of iteration of sink ref - # of iteration of source ref". Given below example: for (i = 0; i < N; ++i) { x = arr[idx_1]; // S1 arr[idx_2] = x; // S2 } if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also this is forward dependence. For example, idx_1 is i + 1 and idx_2 is i; If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also this is backward dependence. For example idx_1 is i and idx_2 is i + 1; In GCC, we always try to subtract idx_2 from idx_1 first in computing classic distance, we could result in negative distance in case of backward dependence. When this happens at dependence carried level, we manually reverse it. When this happens at inner level loop, we simply keep the negative distance. And it's meaningless to talk about forward/backward given dependence is carried at outer level loop. Outer loop vectorization is interesting. The problematic case has backward dependence carried by outer loop. Because we don't check dist vs. max_vf for such dep, the unrolled references could have outer loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's like we have outer loop index resolved as equal. Now it has dependence only if c == c' + 1. I think previous comment on fusion still hold for this and we now need to check backward dependence between the two reference at inner level loop. I guess it's a bit trick to model dependence between a[c][5] and a[c+1][5] using the original references and dependence. I think it's okay to do that because distance of a[c][5] and a[c+1][5] is what we computed previously for the original references at inner level loop. Not sure if I have missed something important. Thanks, bin > > Richard. > >> Thanks, >>
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Fri, Dec 15, 2017 at 1:35 PM, Bin.Chengwrote: > On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng wrote: >> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener >> wrote: >>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng wrote: Hi, As explained in the PR, given below test case: int a[8][10] = { [2][5] = 4 }, c; int main () { short b; int i, d; for (b = 4; b >= 0; b--) for (c = 0; c <= 6; c++) a[c + 1][b + 2] = a[c][b + 1]; for (i = 0; i < 8; i++) for (d = 0; d < 10; d++) if (a[i][d] != (i == 3 && d == 6) * 4) __builtin_abort (); return 0; the loop nest is illegal for vectorization without reversing inner loop. The issue is in data dependence checking of vectorizer, I believe the mentioned revision just exposed this. Previously the vectorization is skipped because of unsupported memory operation. The outer loop vectorization unrolls the outer loop into: for (b = 4; b > 0; b -= 4) { for (c = 0; c <= 6; c++) a[c + 1][6] = a[c][5]; for (c = 0; c <= 6; c++) a[c + 1][5] = a[c][4]; for (c = 0; c <= 6; c++) a[c + 1][4] = a[c][3]; for (c = 0; c <= 6; c++) a[c + 1][3] = a[c][2]; } Then four inner loops are fused into: for (b = 4; b > 0; b -= 4) { for (c = 0; c <= 6; c++) { a[c + 1][6] = a[c][5]; // S1 a[c + 1][5] = a[c][4]; // S2 a[c + 1][4] = a[c][3]; a[c + 1][3] = a[c][2]; } } >>> >>> Note that they are not really "fused" but they are interleaved. With >>> GIMPLE in mind >>> that makes a difference, you should get the equivalent of >>> >>>for (c = 0; c <= 6; c++) >>> { >>>tem1 = a[c][5]; >>>tem2 = a[c][4]; >>>tem3 = a[c][3]; >>>tem4 = a[c][2]; >>>a[c+1][6] = tem1; >>>a[c +1][5] = tem2; >>> a[c+1][4] = tem3; >>> a[c+1][3] = tem4; >>> } >> Yeah, I will double check if this abstract breaks the patch and how. > Hmm, I think this doesn't break it, well at least for part of the > analysis, because it is loop carried (backward) dependence goes wrong, > interleaving or not with the same iteration doesn't matter here. I think the idea is that forward dependences are always fine (negative distance) to vectorize. But with backward dependences we have to adhere to max_vf. It looks like for outer loop vectorization we only look at the distances in the outer loop but never at inner ones. But here the same applies but isn't that independend on the distances with respect to the outer loop? But maybe I'm misunderstanding how "distances" work here. Richard. > Thanks, > bin >> >>> The loop fusion needs to meet the dependence requirement. Basically, GCC's data dependence analyzer does not model dep between references in sibling loops, but in practice, fusion requirement can be checked by analyzing all data references after fusion, and there is no backward data dependence. Apparently, the requirement is violated because we have backward data dependence between references (a[c][5], a[c+1][5]) in S1/S2. Note, if we reverse the inner loop, the outer loop would become legal for vectorization. This patch fixes the issue by enforcing dependence check. It also adds two tests with one shouldn't be vectorized and the other should. Bootstrap and test on x86_64 and AArch64. Is it OK? >>> >>> I think you have identified the spot where things go wrong but I'm not >>> sure you fix the >>> problem fully. The spot you pacth is (loop is the outer loop): >>> >>> loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr)); >>> ... >>> FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) >>> { >>> int dist = dist_v[loop_depth]; >>> ... >>> if (dist > 0 && DDR_REVERSED_P (ddr)) >>> { >>> /* If DDR_REVERSED_P the order of the data-refs in DDR was >>> reversed (to make distance vector positive), and the actual >>> distance is negative. */ >>> if (dump_enabled_p ()) >>> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >>> "dependence distance negative.\n"); >>> >>> where you add >>> >>> + /* When doing outer loop vectorization, we need to check if there >>> is >>> +backward dependence at inner loop level if dependence at the >>> outer >>> +loop is reversed. See PR81740 for more information. */ >>> + if (nested_in_vect_loop_p (loop, DR_STMT (dra)) >>> + || nested_in_vect_loop_p (loop, DR_STMT (drb))) >>> + { >>> +
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Fri, Dec 15, 2017 at 12:09 PM, Bin.Chengwrote: > On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > wrote: >> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng wrote: >>> Hi, >>> As explained in the PR, given below test case: >>> int a[8][10] = { [2][5] = 4 }, c; >>> >>> int >>> main () >>> { >>> short b; >>> int i, d; >>> for (b = 4; b >= 0; b--) >>> for (c = 0; c <= 6; c++) >>> a[c + 1][b + 2] = a[c][b + 1]; >>> for (i = 0; i < 8; i++) >>> for (d = 0; d < 10; d++) >>> if (a[i][d] != (i == 3 && d == 6) * 4) >>> __builtin_abort (); >>> return 0; >>> >>> the loop nest is illegal for vectorization without reversing inner loop. >>> The issue >>> is in data dependence checking of vectorizer, I believe the mentioned >>> revision just >>> exposed this. Previously the vectorization is skipped because of >>> unsupported memory >>> operation. The outer loop vectorization unrolls the outer loop into: >>> >>> for (b = 4; b > 0; b -= 4) >>> { >>> for (c = 0; c <= 6; c++) >>> a[c + 1][6] = a[c][5]; >>> for (c = 0; c <= 6; c++) >>> a[c + 1][5] = a[c][4]; >>> for (c = 0; c <= 6; c++) >>> a[c + 1][4] = a[c][3]; >>> for (c = 0; c <= 6; c++) >>> a[c + 1][3] = a[c][2]; >>> } >>> Then four inner loops are fused into: >>> for (b = 4; b > 0; b -= 4) >>> { >>> for (c = 0; c <= 6; c++) >>> { >>> a[c + 1][6] = a[c][5]; // S1 >>> a[c + 1][5] = a[c][4]; // S2 >>> a[c + 1][4] = a[c][3]; >>> a[c + 1][3] = a[c][2]; >>> } >>> } >> >> Note that they are not really "fused" but they are interleaved. With >> GIMPLE in mind >> that makes a difference, you should get the equivalent of >> >>for (c = 0; c <= 6; c++) >> { >>tem1 = a[c][5]; >>tem2 = a[c][4]; >>tem3 = a[c][3]; >>tem4 = a[c][2]; >>a[c+1][6] = tem1; >>a[c +1][5] = tem2; >> a[c+1][4] = tem3; >> a[c+1][3] = tem4; >> } > Yeah, I will double check if this abstract breaks the patch and how. Hmm, I think this doesn't break it, well at least for part of the analysis, because it is loop carried (backward) dependence goes wrong, interleaving or not with the same iteration doesn't matter here. Thanks, bin > >> >>> The loop fusion needs to meet the dependence requirement. Basically, GCC's >>> data >>> dependence analyzer does not model dep between references in sibling loops, >>> but >>> in practice, fusion requirement can be checked by analyzing all data >>> references >>> after fusion, and there is no backward data dependence. >>> >>> Apparently, the requirement is violated because we have backward data >>> dependence >>> between references (a[c][5], a[c+1][5]) in S1/S2. Note, if we reverse the >>> inner >>> loop, the outer loop would become legal for vectorization. >>> >>> This patch fixes the issue by enforcing dependence check. It also adds two >>> tests >>> with one shouldn't be vectorized and the other should. Bootstrap and test >>> on x86_64 >>> and AArch64. Is it OK? >> >> I think you have identified the spot where things go wrong but I'm not >> sure you fix the >> problem fully. The spot you pacth is (loop is the outer loop): >> >> loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr)); >> ... >> FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) >> { >> int dist = dist_v[loop_depth]; >> ... >> if (dist > 0 && DDR_REVERSED_P (ddr)) >> { >> /* If DDR_REVERSED_P the order of the data-refs in DDR was >> reversed (to make distance vector positive), and the actual >> distance is negative. */ >> if (dump_enabled_p ()) >> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> "dependence distance negative.\n"); >> >> where you add >> >> + /* When doing outer loop vectorization, we need to check if there >> is >> +backward dependence at inner loop level if dependence at the >> outer >> +loop is reversed. See PR81740 for more information. */ >> + if (nested_in_vect_loop_p (loop, DR_STMT (dra)) >> + || nested_in_vect_loop_p (loop, DR_STMT (drb))) >> + { >> + unsigned inner_depth = index_in_loop_nest (loop->inner->num, >> +DDR_LOOP_NEST >> (ddr)); >> + if (dist_v[inner_depth] < 0) >> + return true; >> + } >> >> but I don't understand how the dependence direction with respect to the >> outer loop matters here. > If the direction wrto outer loop is positive by itself, i.e, > reversed_p equals to false, then dist is checked against max_vf. In > this case, it's not possible to have references refer to the same > object? > On the other hand, dist is not checked at all for reversed case. > Maybe an additional
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Fri, Dec 15, 2017 at 11:55 AM, Richard Bienerwrote: > On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng wrote: >> Hi, >> As explained in the PR, given below test case: >> int a[8][10] = { [2][5] = 4 }, c; >> >> int >> main () >> { >> short b; >> int i, d; >> for (b = 4; b >= 0; b--) >> for (c = 0; c <= 6; c++) >> a[c + 1][b + 2] = a[c][b + 1]; >> for (i = 0; i < 8; i++) >> for (d = 0; d < 10; d++) >> if (a[i][d] != (i == 3 && d == 6) * 4) >> __builtin_abort (); >> return 0; >> >> the loop nest is illegal for vectorization without reversing inner loop. >> The issue >> is in data dependence checking of vectorizer, I believe the mentioned >> revision just >> exposed this. Previously the vectorization is skipped because of >> unsupported memory >> operation. The outer loop vectorization unrolls the outer loop into: >> >> for (b = 4; b > 0; b -= 4) >> { >> for (c = 0; c <= 6; c++) >> a[c + 1][6] = a[c][5]; >> for (c = 0; c <= 6; c++) >> a[c + 1][5] = a[c][4]; >> for (c = 0; c <= 6; c++) >> a[c + 1][4] = a[c][3]; >> for (c = 0; c <= 6; c++) >> a[c + 1][3] = a[c][2]; >> } >> Then four inner loops are fused into: >> for (b = 4; b > 0; b -= 4) >> { >> for (c = 0; c <= 6; c++) >> { >> a[c + 1][6] = a[c][5]; // S1 >> a[c + 1][5] = a[c][4]; // S2 >> a[c + 1][4] = a[c][3]; >> a[c + 1][3] = a[c][2]; >> } >> } > > Note that they are not really "fused" but they are interleaved. With > GIMPLE in mind > that makes a difference, you should get the equivalent of > >for (c = 0; c <= 6; c++) > { >tem1 = a[c][5]; >tem2 = a[c][4]; >tem3 = a[c][3]; >tem4 = a[c][2]; >a[c+1][6] = tem1; >a[c +1][5] = tem2; > a[c+1][4] = tem3; > a[c+1][3] = tem4; > } Yeah, I will double check if this abstract breaks the patch and how. > >> The loop fusion needs to meet the dependence requirement. Basically, GCC's >> data >> dependence analyzer does not model dep between references in sibling loops, >> but >> in practice, fusion requirement can be checked by analyzing all data >> references >> after fusion, and there is no backward data dependence. >> >> Apparently, the requirement is violated because we have backward data >> dependence >> between references (a[c][5], a[c+1][5]) in S1/S2. Note, if we reverse the >> inner >> loop, the outer loop would become legal for vectorization. >> >> This patch fixes the issue by enforcing dependence check. It also adds two >> tests >> with one shouldn't be vectorized and the other should. Bootstrap and test >> on x86_64 >> and AArch64. Is it OK? > > I think you have identified the spot where things go wrong but I'm not > sure you fix the > problem fully. The spot you pacth is (loop is the outer loop): > > loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr)); > ... > FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) > { > int dist = dist_v[loop_depth]; > ... > if (dist > 0 && DDR_REVERSED_P (ddr)) > { > /* If DDR_REVERSED_P the order of the data-refs in DDR was > reversed (to make distance vector positive), and the actual > distance is negative. */ > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "dependence distance negative.\n"); > > where you add > > + /* When doing outer loop vectorization, we need to check if there is > +backward dependence at inner loop level if dependence at the > outer > +loop is reversed. See PR81740 for more information. */ > + if (nested_in_vect_loop_p (loop, DR_STMT (dra)) > + || nested_in_vect_loop_p (loop, DR_STMT (drb))) > + { > + unsigned inner_depth = index_in_loop_nest (loop->inner->num, > +DDR_LOOP_NEST (ddr)); > + if (dist_v[inner_depth] < 0) > + return true; > + } > > but I don't understand how the dependence direction with respect to the > outer loop matters here. If the direction wrto outer loop is positive by itself, i.e, reversed_p equals to false, then dist is checked against max_vf. In this case, it's not possible to have references refer to the same object? On the other hand, dist is not checked at all for reversed case. Maybe an additional check "dist < max_vf" can relax the patch a bit. > > Given there's DDR_REVERSED on the outer loop distance what does that > mean for the inner loop distance given the quite non-obvious code handling > this case in tree-data-ref.c: > > /* Verify a basic constraint: classic distance vectors should > always be lexicographically positive. > > Data references are collected in the order of execution of > the
Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization
On Fri, Dec 15, 2017 at 12:30 PM, Bin Chengwrote: > Hi, > As explained in the PR, given below test case: > int a[8][10] = { [2][5] = 4 }, c; > > int > main () > { > short b; > int i, d; > for (b = 4; b >= 0; b--) > for (c = 0; c <= 6; c++) > a[c + 1][b + 2] = a[c][b + 1]; > for (i = 0; i < 8; i++) > for (d = 0; d < 10; d++) > if (a[i][d] != (i == 3 && d == 6) * 4) > __builtin_abort (); > return 0; > > the loop nest is illegal for vectorization without reversing inner loop. The > issue > is in data dependence checking of vectorizer, I believe the mentioned > revision just > exposed this. Previously the vectorization is skipped because of unsupported > memory > operation. The outer loop vectorization unrolls the outer loop into: > > for (b = 4; b > 0; b -= 4) > { > for (c = 0; c <= 6; c++) > a[c + 1][6] = a[c][5]; > for (c = 0; c <= 6; c++) > a[c + 1][5] = a[c][4]; > for (c = 0; c <= 6; c++) > a[c + 1][4] = a[c][3]; > for (c = 0; c <= 6; c++) > a[c + 1][3] = a[c][2]; > } > Then four inner loops are fused into: > for (b = 4; b > 0; b -= 4) > { > for (c = 0; c <= 6; c++) > { > a[c + 1][6] = a[c][5]; // S1 > a[c + 1][5] = a[c][4]; // S2 > a[c + 1][4] = a[c][3]; > a[c + 1][3] = a[c][2]; > } > } Note that they are not really "fused" but they are interleaved. With GIMPLE in mind that makes a difference, you should get the equivalent of for (c = 0; c <= 6; c++) { tem1 = a[c][5]; tem2 = a[c][4]; tem3 = a[c][3]; tem4 = a[c][2]; a[c+1][6] = tem1; a[c +1][5] = tem2; a[c+1][4] = tem3; a[c+1][3] = tem4; } > The loop fusion needs to meet the dependence requirement. Basically, GCC's > data > dependence analyzer does not model dep between references in sibling loops, > but > in practice, fusion requirement can be checked by analyzing all data > references > after fusion, and there is no backward data dependence. > > Apparently, the requirement is violated because we have backward data > dependence > between references (a[c][5], a[c+1][5]) in S1/S2. Note, if we reverse the > inner > loop, the outer loop would become legal for vectorization. > > This patch fixes the issue by enforcing dependence check. It also adds two > tests > with one shouldn't be vectorized and the other should. Bootstrap and test on > x86_64 > and AArch64. Is it OK? I think you have identified the spot where things go wrong but I'm not sure you fix the problem fully. The spot you pacth is (loop is the outer loop): loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr)); ... FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) { int dist = dist_v[loop_depth]; ... if (dist > 0 && DDR_REVERSED_P (ddr)) { /* If DDR_REVERSED_P the order of the data-refs in DDR was reversed (to make distance vector positive), and the actual distance is negative. */ if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "dependence distance negative.\n"); where you add + /* When doing outer loop vectorization, we need to check if there is +backward dependence at inner loop level if dependence at the outer +loop is reversed. See PR81740 for more information. */ + if (nested_in_vect_loop_p (loop, DR_STMT (dra)) + || nested_in_vect_loop_p (loop, DR_STMT (drb))) + { + unsigned inner_depth = index_in_loop_nest (loop->inner->num, +DDR_LOOP_NEST (ddr)); + if (dist_v[inner_depth] < 0) + return true; + } but I don't understand how the dependence direction with respect to the outer loop matters here. Given there's DDR_REVERSED on the outer loop distance what does that mean for the inner loop distance given the quite non-obvious code handling this case in tree-data-ref.c: /* Verify a basic constraint: classic distance vectors should always be lexicographically positive. Data references are collected in the order of execution of the program, thus for the following loop | for (i = 1; i < 100; i++) | for (j = 1; j < 100; j++) | { | t = T[j+1][i-1]; // A | T[j][i] = t + 2; // B | } references are collected following the direction of the wind: A then B. The data dependence tests are performed also following this order, such that we're looking at the distance separating the elements accessed by A from the elements later accessed by B. But in this example, the distance returned by test_dep (A, B) is lexicographically negative (-1, 1), that means that the access A
[PATCH PR81740]Enforce dependence check for outer loop vectorization
Hi, As explained in the PR, given below test case: int a[8][10] = { [2][5] = 4 }, c; int main () { short b; int i, d; for (b = 4; b >= 0; b--) for (c = 0; c <= 6; c++) a[c + 1][b + 2] = a[c][b + 1]; for (i = 0; i < 8; i++) for (d = 0; d < 10; d++) if (a[i][d] != (i == 3 && d == 6) * 4) __builtin_abort (); return 0; the loop nest is illegal for vectorization without reversing inner loop. The issue is in data dependence checking of vectorizer, I believe the mentioned revision just exposed this. Previously the vectorization is skipped because of unsupported memory operation. The outer loop vectorization unrolls the outer loop into: for (b = 4; b > 0; b -= 4) { for (c = 0; c <= 6; c++) a[c + 1][6] = a[c][5]; for (c = 0; c <= 6; c++) a[c + 1][5] = a[c][4]; for (c = 0; c <= 6; c++) a[c + 1][4] = a[c][3]; for (c = 0; c <= 6; c++) a[c + 1][3] = a[c][2]; } Then four inner loops are fused into: for (b = 4; b > 0; b -= 4) { for (c = 0; c <= 6; c++) { a[c + 1][6] = a[c][5]; // S1 a[c + 1][5] = a[c][4]; // S2 a[c + 1][4] = a[c][3]; a[c + 1][3] = a[c][2]; } } The loop fusion needs to meet the dependence requirement. Basically, GCC's data dependence analyzer does not model dep between references in sibling loops, but in practice, fusion requirement can be checked by analyzing all data references after fusion, and there is no backward data dependence. Apparently, the requirement is violated because we have backward data dependence between references (a[c][5], a[c+1][5]) in S1/S2. Note, if we reverse the inner loop, the outer loop would become legal for vectorization. This patch fixes the issue by enforcing dependence check. It also adds two tests with one shouldn't be vectorized and the other should. Bootstrap and test on x86_64 and AArch64. Is it OK? Thanks, bin 2017-12-15 Bin ChengPR tree-optimization/81740 * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In case of outer loop vectorization, check backward dependence at inner loop if dependence at outer loop is reversed. gcc/testsuite 2017-12-15 Bin Cheng PR tree-optimization/81740 * gcc.dg/vect/pr81740-1.c: New test. * gcc.dg/vect/pr81740-2.c: Refine test.From c0c8cfae08c0bde2cec41a8d3abcbfea0bd2e211 Mon Sep 17 00:00:00 2001 From: Bin Cheng Date: Thu, 14 Dec 2017 15:32:02 + Subject: [PATCH] pr81740-20171212.txt --- gcc/testsuite/gcc.dg/vect/pr81740-1.c | 17 + gcc/testsuite/gcc.dg/vect/pr81740-2.c | 21 + gcc/tree-vect-data-refs.c | 11 +++ 3 files changed, 49 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/vect/pr81740-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/pr81740-2.c diff --git a/gcc/testsuite/gcc.dg/vect/pr81740-1.c b/gcc/testsuite/gcc.dg/vect/pr81740-1.c new file mode 100644 index 000..d90aba5 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr81740-1.c @@ -0,0 +1,17 @@ +/* { dg-do run } */ +int a[8][10] = { [2][5] = 4 }, c; + +int +main () +{ + short b; + int i, d; + for (b = 4; b >= 0; b--) +for (c = 0; c <= 6; c++) + a[c + 1][b + 2] = a[c][b + 1]; + for (i = 0; i < 8; i++) +for (d = 0; d < 10; d++) + if (a[i][d] != (i == 3 && d == 6) * 4) +__builtin_abort (); + return 0; +} diff --git a/gcc/testsuite/gcc.dg/vect/pr81740-2.c b/gcc/testsuite/gcc.dg/vect/pr81740-2.c new file mode 100644 index 000..fb5b300 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr81740-2.c @@ -0,0 +1,21 @@ +/* { dg-do run } */ +/* { dg-require-effective-target vect_int } */ + +int a[8][10] = { [2][5] = 4 }, c; + +int +main () +{ + short b; + int i, d; + for (b = 4; b >= 0; b--) +for (c = 6; c >= 0; c--) + a[c + 1][b + 2] = a[c][b + 1]; + for (i = 0; i < 8; i++) +for (d = 0; d < 10; d++) + if (a[i][d] != (i == 3 && d == 6) * 4) +__builtin_abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */ diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index 996d156..3b780cf1 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -435,6 +435,17 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr, if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "dependence distance negative.\n"); + /* When doing outer loop vectorization, we need to check if there is +backward dependence at inner loop level if dependence at the outer +loop is reversed. See PR81740 for more information. */ + if (nested_in_vect_loop_p (loop, DR_STMT (dra)) + || nested_in_vect_loop_p (loop, DR_STMT (drb))) + { + unsigned inner_depth