Re: [PATCH] tree-optimization/120687 - permute optimization wrt unsupported layout 0

Richard Biener Fri, 24 Oct 2025 02:21:44 -0700

On Thu, 23 Oct 2025, Richard Sandiford wrote:

> Richard Biener <[email protected]> writes:
> > On Wed, 22 Oct 2025, Richard Sandiford wrote:
> >
> >> Richard Biener <[email protected]> writes:
> >> > The following tries to addess one shortcoming of the SLP permute
> >> > optimization phase which is that it tries to refuse layouts that
> >> > cannot revert to layout 0 on an edge even if layout 0 isn't actually
> >> > valid.  This hurts in particular on VLA vector targets where many
> >> > permutes cannot be code generated.  While for fixed-length targets
> >> > costs eventually sort things out in a way to chose a "no permutation"
> >> > setting the above check prevents VLA targets from arriving there.
> >> 
> >> I'm not sure this is the right way to go.  The subpass is supposed
> >> to be an optimisation pass rather than a legitimisation/"make something
> >> vectorisable" pass.  The cost function is purely aimed at reducing
> >> permutations rather than at turning unvectorisable graphs into
> >> vectorisation ones.  If the pass happens to make something vectorisable,
> >> that happens by chance rather than by design.
> >
> > Note while the change turns something not vectorizable into something
> > vectorizable it does so by enabling permute optimization to see that
> > a load permutation can be fully elided.  This currently does not work
> > because of the imposed "the reverse permute has to be code generatable"
> > check(s).
> 
> Yeah.  But the purpose of the pass is to optimise permutations in
> cases where the original form is vectorisable, rather than to search
> for a vectorisable form.
> 
> > The existing handling of [in_layout_i == 0 &&] out_layout_i == 0 for
> > a not code generatable permute made me think that the fact that
> > layout 0 isn't code generatable shouldn't prevent us from arriving
> > at the optimal (just from a cost perspective) layout configuration.
> >
> > To this extent, do you agree with the proposed change to
> > internal_node_cost to anticipate redundant permute eliding?
> 
> As far as that change goes, I think it would more convincing to have
> a motivating case in which the pass wasn't being used to turn
> non-vectorisable code into vectorisable code.
> 
> > One could argue that vect_transform_slp_perm_load_1 should
> > not fail for those - in this case it fails for
> > group_size == dr_group_size == 16 but nunits [4,4], so
> > a !repeating_p permute.
> >
> >> I can see that a similar type of pass might be useful for pushing
> >> permutations around until the graph is vectorisable.  But I think
> >> that should be a separate subpass that runs first, with a different goal
> >> function.  Perhaps it would share enough code with the optimisation pass
> >> that they could both be subclasses of a common base class (not sure).
> >
> > I'll note that at least for permutes on loads we're in the difficult
> > situation that permute lowering only handles 
> > interleaving patterns right now and other permutes are 
> > not lowered but left to vectorizable_load.  Trying to legitimize
> > those earlier (anticipating what vectorizable_load would do) would
> > be nice but also necessarily dependent on VF (permute lowering also
> > looks at VF and at a point where it's not yet final ...)
> >
> > But as you say, legitimizing should be separate from optimizing but
> > the issue at hand is that optimizing on a not legitimized graph
> > ties its hands because of the reverse permute requirement ...
> 
> Yeah, but like I say, I think the legitimising should happen first.
> The hand-tying wouldn't matter then: the optimisation pass would start
> with a vectorisable graph (as far as permutes go) and would have an
> opportunity to improve it.
> 
> I think the general legitimisation problem isn't just about eliding.
> It's about pushing permutations around until a new (possibly non-empty)
> set of permutations work.  And since it's a question of legitimisation,
> we should do that even if it involves "pessimising" hotter blocks to make
> colder blocks work, which is something that the optimisation pass tries
> to avoid (when optimising for speed).
> 
> >> > The patch fixes up internal_node_cost to anticipate redundant permute
> >> > eliding and short-cuts the above requirement on edges to partitions
> >> > that do not have a prefered layout with the idea we shouldn't have
> >> > the need to revert to layout zero for those.
> >> >
> >> > The downside is that we now can arrive at invalid layout configurations
> >> > which we deal with simply giving up.
> >> 
> >> Right.  I think that for other testcases that are currently successfully
> >> optimised, and that are vectorisable with and without the optimisation,
> >> we might lose optimisations by doing this, because the pass could get
> >> stuck in a dead end.  The risk is probably higher for VLA testcases,
> >> due to the same property that motivates this patch.
> >> 
> >> The change therefore feels like a bit of a hack to me.
> >
> > It probably is, esp. singling out that -1 layout.  The question is
> > what the alternative is?
> 
> Running a new legitimisation pass just before the optimisation pass
> seems like the way to go to me, but I realise it's not a quick fix.


OK.  I think it should be possible to do "limited" legitimisation for
specific cases like this pretty easily as a pre-pass.  We could do
this within the same framework, re-using graph building, initial
layout computation and materialization / redundant permutation eliding.
Adding just a legitimise_pass () into the current pipeline looks
appealing to me, but I'm not sure how much wreckage we'd eventually
cause (for this case we're just after eliding a redundant permutation).
So better do a full separate pass and not try sharing graph/partition
and layout info?  

Richard.

> >
> > What's the purpose of rejecting the layout at
> >
> >                     /* Reject the layout if it would make layout 0 
> > impossible
> >                        for later partitions.  This amounts to testing that 
> > the
> >                        target supports reversing the layout change on 
> > edges
> >                        to later partitions.
> >           
> >                        In principle, it might be possible to push a layout
> >                        change all the way down a graph, so that it never
> >                        needs to be reversed and so that the target doesn't
> >                        need to support the reverse operation.  But it 
> > would
> >                        be awkward to bail out if we hit a partition that
> >                        does not support the new layout, especially since
> >                        we are not dealing with a lattice.  */
> >                     is_possible &= edge_layout_cost (ud, other_node_i, 0,
> >                                                      layout_i).is_possible 
> > ();
> >
> > ?  It seems we want to enable the backward pass to undo everything the
> > forward pass did, because that might have gone "too far"?  But we
> > do not check that we can "undo", aka go back to the initial chosen
> > layout (which is not always 0 but matches the reverse load permutation?)
> > My alternative idea was to not require the ability to go back if
> > layout 0 is "impossible" in terms of internal_node_cost?
> 
> The assumption is that layout 0 is the conservative choice.  Of course,
> that isn't true if the original layout isn't vectorisable, but like
> I say, it's not IMO the pass's job to handle that case.
> 
> Thanks,
> Richard
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] tree-optimization/120687 - permute optimization wrt unsupported layout 0

Reply via email to