RE: [PATCH][RFC] Add vector_costs::add_vector_cost vector stmt grouping hook

Richard Biener Wed, 06 May 2026 04:04:34 -0700

On Wed, 6 May 2026, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: 15 May 2025 07:48
> > To: Tamar Christina <[email protected]>
> > Cc: Richard Sandiford <[email protected]>; gcc-
> > [email protected]
> > Subject: RE: [PATCH][RFC] Add vector_costs::add_vector_cost vector stmt
> > grouping hook
> > 
> > On Wed, 14 May 2025, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <[email protected]>
> > > > Sent: Tuesday, May 13, 2025 12:08 PM
> > > > To: Richard Sandiford <[email protected]>
> > > > Cc: [email protected]; Tamar Christina
> > <[email protected]>
> > > > Subject: Re: [PATCH][RFC] Add vector_costs::add_vector_cost vector stmt
> > > > grouping hook
> > > >
> > > > On Tue, 13 May 2025, Richard Sandiford wrote:
> > > >
> > > > > Richard Biener <[email protected]> writes:
> > > > > > The following refactors the vectorizer vector_costs target API
> > > > > > to add a new vector_costs::add_vector_cost entry which groups
> > > > > > all individual sub-stmts we create per "vector stmt", aka SLP
> > > > > > node.  This allows for the targets to more easily match on
> > > > > > complex cases like emulated gather/scatter or even just vector
> > > > > > construction.
> > > > > >
> > > > > > The patch itself is just a prototype and leaves out BB vectorization
> > > > > > for simplicity.  It also does not fully group all vector stmts
> > > > > > but leaves some bare add_stmt_cost hook invocations.  I'd expect
> > > > > > the add_stmt_hook to be still used for scalar stmt costing and
> > > > > > for costing added branching around prologue/epilogue.  The
> > > > > > default implementation of add_vector_cost just dispatches to
> > > > > > add_stmt_cost for individual stmts.  Eventually the actual data
> > > > > > we track for the combined costing will diverge (no need to track
> > > > > > SLP node or stmt_info there?), so targets would eventually be
> > > > > > expected to implement both hooks and splice out common workers
> > > > > > to deal with "missing" information coming in from the different
> > > > > > entries.
> > > > > >
> > > > > > This should eventually baby-step us towards the generic vectorizer
> > > > > > code being able to compute and compare latency and resource
> > > > > > utilization throughout the scalar / vector loop iteration based
> > > > > > on latency and throughput data determined on a stmt-by-stmt base
> > > > > > from the target.  As given the grouping should be an incremental
> > > > > > improvement, but I have not tried to see how it can simplify
> > > > > > the x86 hook implementation - I've been triggered by the aarch64
> > > > > > reported bootstrap fail on the cleanup RFC I posted given that
> > > > > > code wants to identify a scalar load that's costed as part of
> > > > > > a gather/scatter operation.
> > > > > >
> > > > > > Any comments or problems you forsee?
> > > > >
> > > > > Could the stmt_vector_for_cost pointer instead be passed to
> > > > > TARGET_VECTORIZE_CREATE_COSTS?  The danger with passing it to
> > > > > add_vector_cost is that the same vector_costs instance might get used
> > > > > for multiple different costing attempts, so that only the provided
> > > > > stmt_vector_for_costs are specific to the current costing attempt.
> > > > > But for complex cases, the target's vector_costs should be able
> > > > > to cache its own target-specific information, with the same
> > > > > lifetime/scope as the stmt_vector_for_costs.
> > > >
> > > > It cannot be passed to TARGET_VECTORIZE_CREATE_COSTS - but I can
> > > > not pass it at all, in the proposed implementation it is
> > > > actually node->cost_vec.  It's the set of stmts we cost for
> > > > a single SLP node.  I'm not sure the "group" is what targets
> > > > would cache, they'd rather cache whatever they make from the
> > > > group and its contents?
> > > >
> > > > That said, the most aggressive way of handling it would be
> > > > to defer everything to the target and just pass in the
> > > > set of SLP instances to TARGET_VECTORIZE_CREATE_COSTS and
> > > > not perform any individual add_stmt_cost calls at all, but expect
> > > > the target to walk the SLP graph at finish_cost () time.
> > > >
> > >
> > > I was actually wondering whether it wouldn't be indeed better to cost
> > > the slp_instances as those contain roots that would need to be costed
> > > too.
> > 
> > Yes, I need to think about that.  But it's also that in practice
> > BB vectorization costing will work quite differently from loop
> > costing since for BB vectorization there's no implicit unrolling
> > and you have to think about surrounding stmts.
> > 
> > > For early break if we're costing purely based on SLP node then the
> > > actual break itself can't be costed as it's not in the node.  We'd need
> > > this to be able to do this to be able to re-order the exits during slp
> > > scheduling based on their actual cost.
> > 
> > Note the proposed prototype patch still gets you add_stmt_cost
> > hook calls for the non-SLP stmts, it's just an easy way to
> > let the target know that costed sub-stmts belong to the same
> > SLP tree.
> > 
> > I'll put this on the side for now.
> 
> I've hit a few cases now where such a change would have been useful.
> The one I've most recently hit was costing of LOAD_LANES with gaps
> and gatther/scatter addressing.
> 
> I think this patch was a step in the right direction, at least it would enable
> targets not to try to "match up" individual costing calls back to a group.


Implementation-wise it might be cheaper to only adjust the target
interfacing by making the additional costing hook receive not
a full vec<> but an array_slice<> given the cost_vector already has
the entries "sorted" by slp_node.  Looking over the patch and how
it currently addresses only loop vectorization and seeing how to
easily generalize, the separate memory management of a cost vector
per SLP node introduces complication (at this point) without a clear
benefit.  For BB costing we have to adjust the sorting by loops
to a stable sort, but otherwise adjusting just add_stmt_costs
to block should work.

I'll send another prototype.

Richard.

> Thanks,
> Tamar
> 
> > 
> > Richard.
> > 
> > >
> > > Cheers,
> > > Tamar
> > >
> > > > The x86 target currently keeps counters of certain ops but
> > > > does not cache the full-blown stmts from add_stmt_cost for
> > > > computing the overall cost at finish_cost.  I'll have to look
> > > > what aarch64 does here.
> > > >
> > > > Ultimatively I'd like to take into account stmt dependences
> > > > during costing - at the moment we are asking the target to
> > > > compute per stmt "latencies" but then we just sum those.
> > > > One improvement would be to compute the max latency through
> > > > the graph and the maximum width (without having throughput
> > > > or port assignments and an actual scheduler implementation).
> > > >
> > > > Richard.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Richard
> > > > >
> > > >
> > > > --
> > > > Richard Biener <[email protected]>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> > 
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH][RFC] Add vector_costs::add_vector_cost vector stmt grouping hook

Reply via email to