On Wed, Oct 15, 2025 at 12:08 AM Kugan Vivekanandarajah
<[email protected]> wrote:
>
> Hi,
>
> This patch eliminates redundant reverse permutations in vectorized reverse
> loops by detecting and optimizing patterns during store vectorization.
>
> The reverse load (b[i]) generates PERM, operations are applied, then the
> reverse store adds another PERM. This creates redundant permute pairs that
> we now detect and eliminate.
>
> With the patch, for the example loop
>   for (int i = N - 1; i >= 0; i--)
>     {
>       a[i] = b[i] + 1.0f;
>     }
> Changes to the following
> -       ldr     q29, [x0, x2]
> -       tbl     v29.16b, {v29.16b}, v31.16b
> -       fadd    v29.4s, v29.4s, v30.4s
> -       tbl     v29.16b, {v29.16b}, v31.16b
> -       str     q29, [x3, x2]
> +       ldr     q30, [x0, x2]
> +       fadd    v30.4s, v30.4s, v31.4s
> +       str     q30, [x3, x2]

So this works basically as a post-processing optimization at the time
we generate the
vector store.  While that's in principle an OK optimization I'd rather
have such post-processing
implemented outside of the vectorizer because then also permutes not
originating from
vectorizer permuted store generation would benefit.

As for implementing this in the vectorizer itself the more appropriate
thing would be
to expose these permutes to the permute optimization phase, because then it can
be also taken into account during costing and a reverse load permute
could be elided
if it feeds an associatable reduction.

There is, unfortunately, currently no good way to represent how we implement
negative strided contiguous accesses with load permutations as the peculiarity
only exposes itself after applying the VF and load/lane permutations are
represented on the VF == 1 SLP graph.  One of my ideas what that once we
settle on VF (and possibly vector types) we want to expand the SLP graph
to cover all lanes of the vector loop so we can expose actual permutes and
vector granularity.  This is a bit far off though.

So in line with your patch but more appropriate for in-vectorizer
operation would
be an analysis on the SLP graph that simply marks reverse permutes that can
be elided (for the back-to-back case).  This way both costing and code
generation
can take this into account and you wouldn't have to adjust any stmts.

Thanks,
Richard.

>         PR tree-optimization/61338
>
> gcc/ChangeLog:
>         (get_vector_perm_operand): New.
>         (vect_find_reverse_permute_operand): New  helper function
>         to find reverse permutations through element-wise operation chains.
>         Returns true only if ALL operands have reverse permutations.
>         (vectorizable_store): Use recursive helper to eliminate redundant
>         reverse permutations with configurable search depth.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/slp-permute-reverse-1.c: New test for basic
>         reverse permute optimization (simple copy).
>         * gcc.dg/vect/slp-permute-reverse-2.c: New runtime test for
>         basic pattern.
> Signed-off-by: Kugan Vivekanandarajah <[email protected]>
>
> Bootstrapped and regression tested on aarch64-linux-gcc. Is this OK?
>
> Thanks,
> Kugan
>
>

Reply via email to