On Wed, 16 Jun 2021, Andre Vieira (lists) wrote: > > On 14/06/2021 11:57, Richard Biener wrote: > > On Mon, 14 Jun 2021, Richard Biener wrote: > > > >> Indeed. For example a simple > >> int a[1024], b[1024], c[1024]; > >> > >> void foo(int n) > >> { > >> for (int i = 0; i < n; ++i) > >> a[i+1] += c[i+i] ? b[i+1] : 0; > >> } > >> > >> should usually see peeling for alignment (though on x86 you need > >> exotic -march= since cost models generally have equal aligned and > >> unaligned access costs). For example with -mavx2 -mtune=atom > >> we'll see an alignment peeling prologue, a AVX2 vector loop, > >> a SSE2 vectorized epilogue and a scalar epilogue. It also > >> shows the original scalar loop being used in the scalar prologue > >> and epilogue. > >> > >> We're not even trying to make the counting IV easily used > >> across loops (we're not counting scalar iterations in the > >> vector loops). > > Specifically we see > > > > <bb 33> [local count: 94607391]: > > niters_vector_mult_vf.10_62 = bnd.9_61 << 3; > > _67 = niters_vector_mult_vf.10_62 + 7; > > _64 = (int) niters_vector_mult_vf.10_62; > > tmp.11_63 = i_43 + _64; > > if (niters.8_45 == niters_vector_mult_vf.10_62) > > goto <bb 37>; [12.50%] > > else > > goto <bb 36>; [87.50%] > > > > after the maini vect loop, recomputing the original IV (i) rather > > than using the inserted canonical IV. And then the vectorized > > epilogue header check doing > > > > <bb 36> [local count: 93293400]: > > # i_59 = PHI <tmp.11_63(33), 0(18)> > > # _66 = PHI <_67(33), 0(18)> > > _96 = (unsigned int) n_10(D); > > niters.26_95 = _96 - _66; > > _108 = (unsigned int) n_10(D); > > _109 = _108 - _66; > > _110 = _109 + 4294967295; > > if (_110 <= 3) > > goto <bb 47>; [10.00%] > > else > > goto <bb 40>; [90.00%] > > > > re-computing everything from scratch again (also notice how > > the main vect loop guard jumps around the alignment prologue > > as well and lands here - and the vectorized epilogue using > > unaligned accesses - good!). > > > > That is, I'd expect _much_ easier jobs if we'd manage to > > track the number of performed scalar iterations (or the > > number of scalar iterations remaining) using the canonical > > IV we add to all loops across all of the involved loops. > > > > Richard. > > > So I am now looking at using an IV that counts scalar iterations rather than > vector iterations and reusing that through all loops, (prologue, main loop, > vect_epilogue and scalar epilogue). The first is easy, since that's what we > already do for partial vectors or non-constant VFs. The latter requires some > plumbing and removing a lot of the code in there that creates new IV's going > from [0, niters - previous iterations]. I don't yet have a clear cut view of > how to do this, I first thought of keeping track of the 'control' IV in the > loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps > a control_ivs struct, but that is used for overflow detection and only keeps > track of what looks like a constant 'base' and 'step'. Not quite sure how all > that works, but intuitively doesn't seem like the right thing to reuse.
Maybe it's enough to maintain this [remaining] scalar iterations counter between loops, thus after the vector loop do remain_scalar_iter -= vector_iters * vf; etc., this should make it possible to do some first order cleanups, avoiding some repeated computations. It does involve placing additional PHIs for this remain_scalar_iter var of course (I'd be hesitant to rely on the SSA renamer for this due to its expense). I think that for all later jump-around tests tracking remaining scalar iters is more convenient than tracking performed scalar iters. > I'll go hack around and keep you posted on progress. Thanks - it's an iffy area ... Richard.