> Tamar Christina <tamar.christ...@arm.com> writes: > > Hi All, > > > > The resulting predicate register of a whilelo is not > > restricted to the lower half of the predicate register file. > > > > As such these tests started failing after recent changes > > because the whilelo outside the loop is getting assigned p15. > > It's the whilelo in the loop for me. We go from: > > .L3: > ld1b z31.b, p7/z, [x4, x3] > movprfx z30, z31 > mul z30.b, p5/m, z30.b, z29.b > st1b z30.b, p7, [x4, x3] > mov p6.b, p7.b > add x3, x3, x0 > whilelo p7.b, w3, w1 > b.any .L3 > > to: > > .L3: > ld1b z31.b, p7/z, [x3, x2] > movprfx z29, z31 > mul z29.b, p6/m, z29.b, z30.b > st1b z29.b, p7, [x3, x2] > add x2, x2, x0 > whilelo p15.b, w2, w1 > b.any .L4 > [...] > .p2align 2,,3 > .L4: > mov p7.b, p15.b > b .L3 > > This adds an extra (admittedly unconditional) branch to every non-final > vector iteration, which seems unfortunate. I don't think we'd see > p8-p15 otherwise, since the result of the whilelo is used as a > governing predicate by the next iteration of the loop. > > This happens because the scalar loop is given an 89% chance of iterating. > Previously we gave the vector loop an 83.33% chance of iterating, whereas > after 061f74c06735e1fa35b910ae we give it a 12% chance. 0.89^16 == 15.50%, > so the new probabilities definitely preserve the original probabilities > more closely. But for purely heuristic probabilities like these, I'm > not sure we should lean so heavily into the idea that the vector > latch is unlikely. > > Honza, Richi, any thoughts? Just wanted to double-check that this > was operating as expected before making the tests accept the (arguably) > less efficient code. It looks like the commit was more aimed at fixing > the profile counts for the epilogues, rather than the main loop.
You are right that we shold not scale down static profiles in case they are artifically flat. It is nice to have actual testcase. Old code used to test: /* Without profile feedback, loops for which we do not know a better estimate are assumed to roll 10 times. When we unroll such loop, it appears to roll too little, and it may even seem to be cold. To avoid this, we ensure that the created loop appears to roll at least 5 times (but at most as many times as before unrolling). Don't do adjustment if profile feedback is present. */ if (new_est_niter < 5 && !profile_p) { if (est_niter < 5) new_est_niter = est_niter; else new_est_niter = 5; } This is not right when profile feedback is around and also when we managed to determine precise #of itrations at branch prediction time and did not cap. So I replaced it iwht the test that adjusted header count is not smaller than the preheader edge count. However this will happily get loop iteration count close to 0. It is bit hard to figure out if profile is realistic: Sometimes we do profile_status_for_fn (cfun) != PROFILE_READ I am trying to get rid of this test. With LTO or when comdat profile is lost we inline together functions with and without profile. We can test for quality of loop header count to be precise or adjusted. However at the time vectorizer is modifying loop profile we already adjusted it for the initial conditional for profitability threshold and drop it to GUESSED.Even with profile feedback we do not know outcome probability of that one (Ondrej Kubanek's histograms will help here). So I think we want to check if we have loop iteration estimate recorded (that should be true for both profile feedback and loops with known trip count) and if so compare it what profile says and it is more or less in match consider profile realistic. This needs to be done before vectorizer starts tampering with the loop. I will try to make patch for that. Honza > > Thanks, > Richard > > > This widens the regexp. > > > > Tested on aarch64-none-linux-gnu and passes again. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/sve/live_1.c: Update assembly. > > > > --- inline copy of patch -- > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/live_1.c > > b/gcc/testsuite/gcc.target/aarch64/sve/live_1.c > > index > > 80ee176d1807bf628ad47551d69ff5d84deda79e..2db6c3c209a9514646e92628f3d2dd58d466539c > > 100644 > > --- a/gcc/testsuite/gcc.target/aarch64/sve/live_1.c > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/live_1.c > > @@ -27,10 +27,10 @@ > > > > TEST_ALL (EXTRACT_LAST) > > > > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].b, } 2 } } */ > > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].h, } 4 } } */ > > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].s, } 4 } } */ > > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7].d, } 4 } } */ > > +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]+.b, } 2 } } */ > > +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]+.h, } 4 } } */ > > +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]+.s, } 4 } } */ > > +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]+.d, } 4 } } */ > > > > /* { dg-final { scan-assembler-times {\tlastb\tb[0-9]+, p[0-7], > > z[0-9]+\.b\n} 1 } } */ > > /* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], > > z[0-9]+\.h\n} 2 } } */