On Wed, 6 May 2026, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <[email protected]> > > Sent: 15 May 2025 07:48 > > To: Tamar Christina <[email protected]> > > Cc: Richard Sandiford <[email protected]>; gcc- > > [email protected] > > Subject: RE: [PATCH][RFC] Add vector_costs::add_vector_cost vector stmt > > grouping hook > > > > On Wed, 14 May 2025, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <[email protected]> > > > > Sent: Tuesday, May 13, 2025 12:08 PM > > > > To: Richard Sandiford <[email protected]> > > > > Cc: [email protected]; Tamar Christina > > <[email protected]> > > > > Subject: Re: [PATCH][RFC] Add vector_costs::add_vector_cost vector stmt > > > > grouping hook > > > > > > > > On Tue, 13 May 2025, Richard Sandiford wrote: > > > > > > > > > Richard Biener <[email protected]> writes: > > > > > > The following refactors the vectorizer vector_costs target API > > > > > > to add a new vector_costs::add_vector_cost entry which groups > > > > > > all individual sub-stmts we create per "vector stmt", aka SLP > > > > > > node. This allows for the targets to more easily match on > > > > > > complex cases like emulated gather/scatter or even just vector > > > > > > construction. > > > > > > > > > > > > The patch itself is just a prototype and leaves out BB vectorization > > > > > > for simplicity. It also does not fully group all vector stmts > > > > > > but leaves some bare add_stmt_cost hook invocations. I'd expect > > > > > > the add_stmt_hook to be still used for scalar stmt costing and > > > > > > for costing added branching around prologue/epilogue. The > > > > > > default implementation of add_vector_cost just dispatches to > > > > > > add_stmt_cost for individual stmts. Eventually the actual data > > > > > > we track for the combined costing will diverge (no need to track > > > > > > SLP node or stmt_info there?), so targets would eventually be > > > > > > expected to implement both hooks and splice out common workers > > > > > > to deal with "missing" information coming in from the different > > > > > > entries. > > > > > > > > > > > > This should eventually baby-step us towards the generic vectorizer > > > > > > code being able to compute and compare latency and resource > > > > > > utilization throughout the scalar / vector loop iteration based > > > > > > on latency and throughput data determined on a stmt-by-stmt base > > > > > > from the target. As given the grouping should be an incremental > > > > > > improvement, but I have not tried to see how it can simplify > > > > > > the x86 hook implementation - I've been triggered by the aarch64 > > > > > > reported bootstrap fail on the cleanup RFC I posted given that > > > > > > code wants to identify a scalar load that's costed as part of > > > > > > a gather/scatter operation. > > > > > > > > > > > > Any comments or problems you forsee? > > > > > > > > > > Could the stmt_vector_for_cost pointer instead be passed to > > > > > TARGET_VECTORIZE_CREATE_COSTS? The danger with passing it to > > > > > add_vector_cost is that the same vector_costs instance might get used > > > > > for multiple different costing attempts, so that only the provided > > > > > stmt_vector_for_costs are specific to the current costing attempt. > > > > > But for complex cases, the target's vector_costs should be able > > > > > to cache its own target-specific information, with the same > > > > > lifetime/scope as the stmt_vector_for_costs. > > > > > > > > It cannot be passed to TARGET_VECTORIZE_CREATE_COSTS - but I can > > > > not pass it at all, in the proposed implementation it is > > > > actually node->cost_vec. It's the set of stmts we cost for > > > > a single SLP node. I'm not sure the "group" is what targets > > > > would cache, they'd rather cache whatever they make from the > > > > group and its contents? > > > > > > > > That said, the most aggressive way of handling it would be > > > > to defer everything to the target and just pass in the > > > > set of SLP instances to TARGET_VECTORIZE_CREATE_COSTS and > > > > not perform any individual add_stmt_cost calls at all, but expect > > > > the target to walk the SLP graph at finish_cost () time. > > > > > > > > > > I was actually wondering whether it wouldn't be indeed better to cost > > > the slp_instances as those contain roots that would need to be costed > > > too. > > > > Yes, I need to think about that. But it's also that in practice > > BB vectorization costing will work quite differently from loop > > costing since for BB vectorization there's no implicit unrolling > > and you have to think about surrounding stmts. > > > > > For early break if we're costing purely based on SLP node then the > > > actual break itself can't be costed as it's not in the node. We'd need > > > this to be able to do this to be able to re-order the exits during slp > > > scheduling based on their actual cost. > > > > Note the proposed prototype patch still gets you add_stmt_cost > > hook calls for the non-SLP stmts, it's just an easy way to > > let the target know that costed sub-stmts belong to the same > > SLP tree. > > > > I'll put this on the side for now. > > I've hit a few cases now where such a change would have been useful. > The one I've most recently hit was costing of LOAD_LANES with gaps > and gatther/scatter addressing. > > I think this patch was a step in the right direction, at least it would enable > targets not to try to "match up" individual costing calls back to a group.
Implementation-wise it might be cheaper to only adjust the target interfacing by making the additional costing hook receive not a full vec<> but an array_slice<> given the cost_vector already has the entries "sorted" by slp_node. Looking over the patch and how it currently addresses only loop vectorization and seeing how to easily generalize, the separate memory management of a cost vector per SLP node introduces complication (at this point) without a clear benefit. For BB costing we have to adjust the sorting by loops to a stable sort, but otherwise adjusting just add_stmt_costs to block should work. I'll send another prototype. Richard. > Thanks, > Tamar > > > > > Richard. > > > > > > > > Cheers, > > > Tamar > > > > > > > The x86 target currently keeps counters of certain ops but > > > > does not cache the full-blown stmts from add_stmt_cost for > > > > computing the overall cost at finish_cost. I'll have to look > > > > what aarch64 does here. > > > > > > > > Ultimatively I'd like to take into account stmt dependences > > > > during costing - at the moment we are asking the target to > > > > compute per stmt "latencies" but then we just sum those. > > > > One improvement would be to compute the max latency through > > > > the graph and the maximum width (without having throughput > > > > or port assignments and an actual scheduler implementation). > > > > > > > > Richard. > > > > > > > > > > > > > > Thanks, > > > > > Richard > > > > > > > > > > > > > -- > > > > Richard Biener <[email protected]> > > > > SUSE Software Solutions Germany GmbH, > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > Nuernberg) > > > > > > > -- > > Richard Biener <[email protected]> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > Nuernberg) > -- Richard Biener <[email protected]> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
