On Wed, Oct 15, 2025 at 12:08 AM Kugan Vivekanandarajah <[email protected]> wrote: > > Hi, > > This patch eliminates redundant reverse permutations in vectorized reverse > loops by detecting and optimizing patterns during store vectorization. > > The reverse load (b[i]) generates PERM, operations are applied, then the > reverse store adds another PERM. This creates redundant permute pairs that > we now detect and eliminate. > > With the patch, for the example loop > for (int i = N - 1; i >= 0; i--) > { > a[i] = b[i] + 1.0f; > } > Changes to the following > - ldr q29, [x0, x2] > - tbl v29.16b, {v29.16b}, v31.16b > - fadd v29.4s, v29.4s, v30.4s > - tbl v29.16b, {v29.16b}, v31.16b > - str q29, [x3, x2] > + ldr q30, [x0, x2] > + fadd v30.4s, v30.4s, v31.4s > + str q30, [x3, x2]
So this works basically as a post-processing optimization at the time we generate the vector store. While that's in principle an OK optimization I'd rather have such post-processing implemented outside of the vectorizer because then also permutes not originating from vectorizer permuted store generation would benefit. As for implementing this in the vectorizer itself the more appropriate thing would be to expose these permutes to the permute optimization phase, because then it can be also taken into account during costing and a reverse load permute could be elided if it feeds an associatable reduction. There is, unfortunately, currently no good way to represent how we implement negative strided contiguous accesses with load permutations as the peculiarity only exposes itself after applying the VF and load/lane permutations are represented on the VF == 1 SLP graph. One of my ideas what that once we settle on VF (and possibly vector types) we want to expand the SLP graph to cover all lanes of the vector loop so we can expose actual permutes and vector granularity. This is a bit far off though. So in line with your patch but more appropriate for in-vectorizer operation would be an analysis on the SLP graph that simply marks reverse permutes that can be elided (for the back-to-back case). This way both costing and code generation can take this into account and you wouldn't have to adjust any stmts. Thanks, Richard. > PR tree-optimization/61338 > > gcc/ChangeLog: > (get_vector_perm_operand): New. > (vect_find_reverse_permute_operand): New helper function > to find reverse permutations through element-wise operation chains. > Returns true only if ALL operands have reverse permutations. > (vectorizable_store): Use recursive helper to eliminate redundant > reverse permutations with configurable search depth. > > gcc/testsuite/ChangeLog: > > * gcc.dg/vect/slp-permute-reverse-1.c: New test for basic > reverse permute optimization (simple copy). > * gcc.dg/vect/slp-permute-reverse-2.c: New runtime test for > basic pattern. > Signed-off-by: Kugan Vivekanandarajah <[email protected]> > > Bootstrapped and regression tested on aarch64-linux-gcc. Is this OK? > > Thanks, > Kugan > >
