On Mon, Dec 18, 2017 at 1:37 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener >>>>> <richard.guent...@gmail.com> wrote: >>>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>>>> Hi, >>>>>>> As explained in the PR, given below test case: >>>>>>> int a[8][10] = { [2][5] = 4 }, c; >>>>>>> >>>>>>> int >>>>>>> main () >>>>>>> { >>>>>>> short b; >>>>>>> int i, d; >>>>>>> for (b = 4; b >= 0; b--) >>>>>>> for (c = 0; c <= 6; c++) >>>>>>> a[c + 1][b + 2] = a[c][b + 1]; >>>>>>> for (i = 0; i < 8; i++) >>>>>>> for (d = 0; d < 10; d++) >>>>>>> if (a[i][d] != (i == 3 && d == 6) * 4) >>>>>>> __builtin_abort (); >>>>>>> return 0; >>>>>>> >>>>>>> the loop nest is illegal for vectorization without reversing inner >>>>>>> loop. The issue >>>>>>> is in data dependence checking of vectorizer, I believe the mentioned >>>>>>> revision just >>>>>>> exposed this. Previously the vectorization is skipped because of >>>>>>> unsupported memory >>>>>>> operation. The outer loop vectorization unrolls the outer loop into: >>>>>>> >>>>>>> for (b = 4; b > 0; b -= 4) >>>>>>> { >>>>>>> for (c = 0; c <= 6; c++) >>>>>>> a[c + 1][6] = a[c][5]; >>>>>>> for (c = 0; c <= 6; c++) >>>>>>> a[c + 1][5] = a[c][4]; >>>>>>> for (c = 0; c <= 6; c++) >>>>>>> a[c + 1][4] = a[c][3]; >>>>>>> for (c = 0; c <= 6; c++) >>>>>>> a[c + 1][3] = a[c][2]; >>>>>>> } >>>>>>> Then four inner loops are fused into: >>>>>>> for (b = 4; b > 0; b -= 4) >>>>>>> { >>>>>>> for (c = 0; c <= 6; c++) >>>>>>> { >>>>>>> a[c + 1][6] = a[c][5]; // S1 >>>>>>> a[c + 1][5] = a[c][4]; // S2 >>>>>>> a[c + 1][4] = a[c][3]; >>>>>>> a[c + 1][3] = a[c][2]; >>>>>>> } >>>>>>> } >>>>>> >>>>>> Note that they are not really "fused" but they are interleaved. With >>>>>> GIMPLE in mind >>>>>> that makes a difference, you should get the equivalent of >>>>>> >>>>>> for (c = 0; c <= 6; c++) >>>>>> { >>>>>> tem1 = a[c][5]; >>>>>> tem2 = a[c][4]; >>>>>> tem3 = a[c][3]; >>>>>> tem4 = a[c][2]; >>>>>> a[c+1][6] = tem1; >>>>>> a[c +1][5] = tem2; >>>>>> a[c+1][4] = tem3; >>>>>> a[c+1][3] = tem4; >>>>>> } >>>>> Yeah, I will double check if this abstract breaks the patch and how. >>>> Hmm, I think this doesn't break it, well at least for part of the >>>> analysis, because it is loop carried (backward) dependence goes wrong, >>>> interleaving or not with the same iteration doesn't matter here. >>> >>> I think the idea is that forward dependences are always fine (negative >>> distance) >>> to vectorize. But with backward dependences we have to adhere to max_vf. >>> >>> It looks like for outer loop vectorization we only look at the distances in >>> the >>> outer loop but never at inner ones. But here the same applies but isn't >>> that >>> independend on the distances with respect to the outer loop? >>> >>> But maybe I'm misunderstanding how "distances" work here. >> Hmm, I am not sure I understand "distance" correctly. With >> description as in book like "Optimizing compilers for Modern >> Architectures", distance is "# of iteration of sink ref - # of >> iteration of source ref". Given below example: >> for (i = 0; i < N; ++i) >> { >> x = arr[idx_1]; // S1 >> arr[idx_2] = x; // S2 >> } >> if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also >> this is forward dependence. For example, idx_1 is i + 1 and idx_2 is >> i; >> If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also >> this is backward dependence. For example idx_1 is i and idx_2 is i + >> 1; >> >> In GCC, we always try to subtract idx_2 from idx_1 first in computing >> classic distance, we could result in negative distance in case of >> backward dependence. When this happens at dependence carried level, >> we manually reverse it. When this happens at inner level loop, we >> simply keep the negative distance. And it's meaningless to talk about >> forward/backward given dependence is carried at outer level loop. >> >> Outer loop vectorization is interesting. The problematic case has >> backward dependence carried by outer loop. Because we don't check >> dist vs. max_vf for such dep, the unrolled references could have outer >> loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's >> like we have outer loop index resolved as equal. Now it has >> dependence only if c == c' + 1. I think previous comment on fusion >> still hold for this and we now need to check backward dependence >> between the two reference at inner level loop. I guess it's a bit >> trick to model dependence between a[c][5] and a[c+1][5] using the >> original references and dependence. I think it's okay to do that >> because distance of a[c][5] and a[c+1][5] is what we computed >> previously for the original references at inner level loop. >> >> Not sure if I have missed something important. > > Not sure either. unroll-and-jam does > > FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) > { > /* A distance (a,b) is at worst transformed into (a/N,b) by the > unrolling (factor N), so the transformation is valid if > a >= N, or b > 0, or b is zero and a > 0. Otherwise the unroll > factor needs to be limited so that the first condition holds. > That may limit the factor down to zero in the worst case. */ > int dist = dist_v[0]; > if (dist < 0) > gcc_unreachable (); > else if ((unsigned)dist >= *unroll) > ; > else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > && dist > 0)) > ; > else > *unroll = dist; > > where *unroll is similar to *max_vf I think. dist_v[0] is the innermost loop. > > The vectorizer does way more complicated things and only looks at the > distance with respect to the outer loop as far as I can see which can > be negative. > > Not sure if fusion and vectorizer "interleaving" makes a difference here. > I think the idea was that when interleaving stmt-by-stmt then forward > dependences would be preserved and thus we don't need to check the > inner loop dependences. speaking with "forward vs. backward" dependences > again, not distances... > > This also means that unroll-and-jam could be enhanced to "interleave" > stmts and thus cover more cases? > > Again, I hope Micha can have a look here...
And _maybe_ PR60276 and its fix (adding STMT_VINFO_MIN_NEG_DIST) is related. Richard. > Richard. > >> Thanks, >> bin >>> >>> Richard. >>> >>>> Thanks, >>>> bin >>>>> >>>>>> >>>>>>> The loop fusion needs to meet the dependence requirement. Basically, >>>>>>> GCC's data >>>>>>> dependence analyzer does not model dep between references in sibling >>>>>>> loops, but >>>>>>> in practice, fusion requirement can be checked by analyzing all data >>>>>>> references >>>>>>> after fusion, and there is no backward data dependence. >>>>>>> >>>>>>> Apparently, the requirement is violated because we have backward data >>>>>>> dependence >>>>>>> between references (a[c][5], a[c+1][5]) in S1/S2. Note, if we reverse >>>>>>> the inner >>>>>>> loop, the outer loop would become legal for vectorization. >>>>>>> >>>>>>> This patch fixes the issue by enforcing dependence check. It also adds >>>>>>> two tests >>>>>>> with one shouldn't be vectorized and the other should. Bootstrap and >>>>>>> test on x86_64 >>>>>>> and AArch64. Is it OK? >>>>>> >>>>>> I think you have identified the spot where things go wrong but I'm not >>>>>> sure you fix the >>>>>> problem fully. The spot you pacth is (loop is the outer loop): >>>>>> >>>>>> loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr)); >>>>>> ... >>>>>> FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) >>>>>> { >>>>>> int dist = dist_v[loop_depth]; >>>>>> ... >>>>>> if (dist > 0 && DDR_REVERSED_P (ddr)) >>>>>> { >>>>>> /* If DDR_REVERSED_P the order of the data-refs in DDR was >>>>>> reversed (to make distance vector positive), and the actual >>>>>> distance is negative. */ >>>>>> if (dump_enabled_p ()) >>>>>> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >>>>>> "dependence distance negative.\n"); >>>>>> >>>>>> where you add >>>>>> >>>>>> + /* When doing outer loop vectorization, we need to check if >>>>>> there is >>>>>> + backward dependence at inner loop level if dependence at >>>>>> the outer >>>>>> + loop is reversed. See PR81740 for more information. */ >>>>>> + if (nested_in_vect_loop_p (loop, DR_STMT (dra)) >>>>>> + || nested_in_vect_loop_p (loop, DR_STMT (drb))) >>>>>> + { >>>>>> + unsigned inner_depth = index_in_loop_nest >>>>>> (loop->inner->num, >>>>>> + DDR_LOOP_NEST >>>>>> (ddr)); >>>>>> + if (dist_v[inner_depth] < 0) >>>>>> + return true; >>>>>> + } >>>>>> >>>>>> but I don't understand how the dependence direction with respect to the >>>>>> outer loop matters here. >>>>> If the direction wrto outer loop is positive by itself, i.e, >>>>> reversed_p equals to false, then dist is checked against max_vf. In >>>>> this case, it's not possible to have references refer to the same >>>>> object? >>>>> On the other hand, dist is not checked at all for reversed case. >>>>> Maybe an additional check "dist < max_vf" can relax the patch a bit. >>>>>> >>>>>> Given there's DDR_REVERSED on the outer loop distance what does that >>>>>> mean for the inner loop distance given the quite non-obvious code >>>>>> handling >>>>>> this case in tree-data-ref.c: >>>>>> >>>>>> /* Verify a basic constraint: classic distance vectors should >>>>>> always be lexicographically positive. >>>>>> >>>>>> Data references are collected in the order of execution of >>>>>> the program, thus for the following loop >>>>>> >>>>>> | for (i = 1; i < 100; i++) >>>>>> | for (j = 1; j < 100; j++) >>>>>> | { >>>>>> | t = T[j+1][i-1]; // A >>>>>> | T[j][i] = t + 2; // B >>>>>> | } >>>>>> >>>>>> references are collected following the direction of the wind: >>>>>> A then B. The data dependence tests are performed also >>>>>> following this order, such that we're looking at the distance >>>>>> separating the elements accessed by A from the elements later >>>>>> accessed by B. But in this example, the distance returned by >>>>>> test_dep (A, B) is lexicographically negative (-1, 1), that >>>>>> means that the access A occurs later than B with respect to >>>>>> the outer loop, ie. we're actually looking upwind. In this >>>>>> case we solve test_dep (B, A) looking downwind to the >>>>>> lexicographically positive solution, that returns the >>>>>> distance vector (1, -1). */ >>>>>> if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr))) >>>>>> { >>>>>> lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr)); >>>>>> if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest)) >>>>>> return false; >>>>>> compute_subscript_distance (ddr); >>>>>> if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b, >>>>>> &index_carry)) >>>>>> return false; >>>>>> save_dist_v (ddr, save_v); >>>>>> DDR_REVERSED_P (ddr) = true; >>>>>> >>>>>> that is, what does dist_v[inner_depth] < 0 tell us here? Does it tell us >>>>>> that the dependence direction is reverse of the outer loop? >>>>> Hmm, this part of code is a bit confusing for me. So we always >>>>> compute classic distance by sorting two data references in lexico >>>>> order, which could be the opposite given we collect references by dom >>>>> order. IIUC, the negative dist at inner loop means a backward >>>>> dependence for that index/loop. It's not related to outer loop or the >>>>> reversed_p flag. >>>>> >>>>> Thanks, >>>>> bin >>>>>> >>>>>> CCing Micha who might remember some more details from unroll-and-jam. >>>>>> >>>>>> Thanks, >>>>>> Richard. >>>>>> >>>>>>> Thanks, >>>>>>> bin >>>>>>> 2017-12-15 Bin Cheng <bin.ch...@arm.com> >>>>>>> >>>>>>> PR tree-optimization/81740 >>>>>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In >>>>>>> case >>>>>>> of outer loop vectorization, check backward dependence at inner >>>>>>> loop >>>>>>> if dependence at outer loop is reversed. >>>>>>> >>>>>>> gcc/testsuite >>>>>>> 2017-12-15 Bin Cheng <bin.ch...@arm.com> >>>>>>> >>>>>>> PR tree-optimization/81740 >>>>>>> * gcc.dg/vect/pr81740-1.c: New test. >>>>>>> * gcc.dg/vect/pr81740-2.c: Refine test.