On Thu, Sep 18, 2025 at 10:19 PM Robin Dapp <rdapp....@gmail.com> wrote:
>
> > But the vector type we perform the permutation on should be unchanged (it's
> > not the punned type but the original type we pun the loaded vector back to)?
>
> Yeah, I was trying to re-use what we have but I see now that just passing a
> different vectype to vect_transform_slp_perm_load doesn't work in all cases.
>
> But apart from that I cannot think of a good or canonical way of achieving the
> "filtering" I want.  The high-level picture is that every node only accesses a
> contiguous part of the group which is represented in the load perm.
>
> I guess a more orthodox way would be to try to pun the whole group (of size 8
> here) with a vector element instead of just the number of SLP lanes.  Right 
> now
> it just fits "by accident".  Then introduce load-permutation handling for the
> result.  That would also involve adjusting ncopies like in the 
> VMAT_STRIDED_SLP
> case (thus making gather/scatter more similar to VMAT_STRIDED_SLP) but is
> eventually doable.
>
> In the end we'd have 2x the number of loads with larger element size in my
> example that would be needed to permute into place.  Even with that we'd 
> arrive
> at a point where we would want to recognize that only half, quarter, etc. of a
> group is actually used in a node and adjust the pun element-size accordingly.
>
> So I'm not sure there is a way of recognizing this from just the group or the
> gap or another property.  If there is I would be glad to use it but all I can
> come up with is actually inspecting the load permutation per node.  When it is
> monotonic/contiguous we can pun more efficiently so to say.  Otherwise we need
> to "capture" the whole group with a punned element.

The load permutation works with the idea that we have a contiguous
stream of whole-DR-group lanes.  How we end up with that is an implementation
detail - so when we use a strided load with punned elements this still fits
when the result contains the whole group (including a possible gap at the end,
when !STMT_VINFO_STRIDED_P).  IIRC your patches did not attempt to
change the result of the load (that would be invalid), so the easiest way might
be to simply apply the load permute transform at the end (and make sure we
can perform the permute, of course).

The missed optimization (like with the VMAT_ELEMENTWISE case) is then
only that we could possibly implement the permutation by changing the
order or size of the loads themselves (for example not load a gap if the
only thing the permute is doing is to get rid of it).

Richard.

>
> --
> Regards
>  Robin
>

Reply via email to