On Tue, Sep 16, 2025 at 3:07 PM Robin Dapp <rdapp....@gmail.com> wrote:
>
> > I think this now conflicts a bit with what I just pushed (sorry).
> >
> >>        && loop_vinfo)
> >>      {
> >> +      unsigned i, j;
> >> +      bool simple_perm_series = true;
> >> +      FOR_EACH_VEC_ELT (SLP_TREE_LOAD_PERMUTATION (slp_node), i, j)
> >> +       if (i != j)
> >> +         simple_perm_series = false;
> >
> > In particular I disallow all load permutes, since we are going to elide it.
>
> The load permutation in the x264 case is simple, {0, 1, 2, 3}, {4, 5, 6, 7},
> etc.  If we're disallowing these as well here there is no point in having the
> new function ;)
>
> So we're not allowing load permutations here because we would elide (=not
> generate code for) them via
> >   /* ???  The following checks should really be part of
> >      get_load_store_type.  */
> >   if (SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
> even though perm_ok = true?

Yes.

> Isn't this all single-element, single-lane anyway?

Well, what you want to catch now isn't single-lane anymore.  But I guess since
we now check the permute before this we can rely on check for n_perms == 0
to catch the "no actual permutation required" case?

> Can I re-allow simple permutations (as they are nops anyway)?  Or rather add a
> vect_transform_slp_perm_load in the mat_gather_scatter_p part of
> vectorizable_load?

So can you pass down n_perms and check that instead of your loop over
the permutation?

> The case I was trying to avoid (for now) when disabling load permutations
> is
>
>   for (int i = 0; i < 512; ++i)
>     {
>       x[2*i] = y[1023 - (2*i)];
>       x[2*i+1] = y[1023 - (2*i+1)];
>     }
>
> (load permutation {1, 0}).

Yeah, given we do not apply permutations after gathering the vector (because,
with gathers, there's never any load permutation ... until now).

> > Re-doing this after the above is a bit ugly.  Likewise having pun_vectype.
> > It's also an opportunity to move the VMAT_STRIDED_SLP punning and
> > alignment (re-)computation code here.  OTOH this is complicated enough
> > already.
>
> I can try but I'm not sure it will be any less complex.
>
> > I believe you now have to change
> >
> >   /* ???  The following checks should really be part of
> >      get_load_store_type.  */
> >   if (SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
> >       && !((memory_access_type == VMAT_ELEMENTWISE
> >             || mat_gather_scatter_p (memory_access_type))
> >            && SLP_TREE_LANES (slp_node) == 1
> >            && (!grouped_load
> >                || !DR_GROUP_NEXT_ELEMENT (first_stmt_info))))
> >     {
> >
> > as otherwise you'll get slp_perm set but the permute will be silently
> > elided.  This is all a bit ugly already :/
>
> So we're setting slp_perm but know that we don't act on it in 
> VMAT_ELEMENTWISE?

So for VMAT_ELEMENTWISE we _do_ apply a permutation (not sure that works
though, too many guard rails around VMAT_ELEMENTWISE).  But when we do
not run into the above body we have slp_perm = false and skip applying a
permute.  The above tries to catch the old strided load case and the
VMAT_ELEMENTWISE
we fall back to when we cannot materialize the permute.

I don't like very much how this is spread over multiple places ...

Ideally VMAT_ELEMENTWISE would support all permutes by means of
directly honoring it during vector construction, so it shouldn't matter whether
vect_transform_slp_perm_load would succeed ot fail.  But that's not done yet.
I think I'd want to split VMAT_ELEMENTWISE from VMAT_STRIDED_SLP for
this, but until I'll try I don't know whether I'll like that.

Given we now have gathers that can have a permutation - like the
following:

> >
> > Now I wonder if/how we handle
> >
> >  for (i = 0; i < n; ++i)
> >    {
> >       int j = offset[i];
> >       sum += data[2*i] + data[2*i+1];
> >    }

could be such one, it probably makes sense to handle load-permutation
materialization after the gather operation.  OTOH I'd like to get rid
of load-permutations, but ...

> > aka a gather grouped load.  Maybe there's an opportunity to handle
> > the "punning" higher up, representing this as a single-element group
> > in the first place.  Hmm.
> >
> > Anyway, I think the general direction of the patch is OK.  You'll have to
> > figure what I just broke though and I'm still somewhat missing the
> > "beauty" moment when thinking of how the VMAT_STRIDED_SLP
> >
> >       /* ???  Modify local copies of alignment_support_scheme and
> >          misalignment, but this part of analysis should be done
> >          earlier and remembered, likewise the chosen load mode.  */
> >
> > parts resolve themselves into happiness with this ... ;)
>
> I'm not sure I'm capable of producing a beauty moment in the vectorizer ;)

Heh, I wasn't expecting you to produce beauty - it's not that I don't fail there
as well :/

> but I'll try how ugly it gets when consolidating the type punning.

Yep, I was just thinking that as it's similar (and we've come from doing
the strided load in that code path to the gather one), we might put up
something re-usable.

Richard.

>
> --
> Regards
>  Robin
>

Reply via email to