On Tue, May 12, 2026 at 1:32 PM Richard Sandiford <[email protected]> wrote: > > Tamar Christina <[email protected]> writes: > >> -----Original Message----- > >> From: Richard Sandiford <[email protected]> > >> Sent: 06 May 2026 23:12 > >> To: Tamar Christina <[email protected]> > >> Cc: [email protected]; nd <[email protected]>; Richard Earnshaw > >> <[email protected]>; [email protected]; Alex Coplan > >> <[email protected]>; [email protected]; Wilco Dijkstra > >> <[email protected]>; Alice Carlotti <[email protected]> > >> Subject: Re: [PATCH]AArch64: account for load_lanes with gaps in costing > >> [PR124866] > >> > >> Tamar Christina <[email protected]> writes: > >> > The vectorizer when costing LOAD_LANES with gaps breaks the costing into > >> > multiple calls where the first call wants to cost GAPS loads and second > >> > GROUP_SIZE - GAPS loads. However: > >> > > >> > 1. Both VMAT types are LOAD_LANES, we we cost effectively > >> > GROUP_SIZE * LOAD_LANES. This increases the latency costs immensely. > >> > > >> > 2. The GROUP_SIZE costing also increases the throughput costs * > >> GROUP_SIZE, > >> > which has the knock on effect of increasing the costs for the > > latency again > >> > as the compiler applies a latency penalty due to the scalar code > > issuing > >> > quicker. > >> > > >> > What the vectorizer is trying to cost is simply 1x LOAD_LANES but > > it's asking > >> to > >> > cost the invidual gaps as well. However for an LDn operation this > > costs is > >> > always fixed and is the cost of the LDn itself. > >> > > >> > when there isn't any gaps we only cost 1x LOAD_LANES which is the > >> expected > >> > amount. Unfortunately we can't really tell in the backend easily > > what chain > >> of > >> > costing belongs to any one operation and so treat them as individual > >> > ones. > >> > > >> > This patch keeps track of the LOAD_LANES we've seen already when there is > >> a gap > >> > and > >> > > >> > 1. marks the cost of the first GAPS costing to 0. > >> > 2. costs the cost of the second GROUP_SIZE - GAPS loads as a single > >> LOAD_LANES. > >> > > >> > This is the best we can do until Richi's changes to cost operations > > like these > >> > as a group. > >> > > >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > >> > >> I think we should just remove the: > >> > >> /* An IFN_LOAD_LANES will load all its vector results, > >> regardless of which ones we actually need. Account > >> for the cost of unused results. */ > >> if (first_stmt_info == stmt_info) > >> { > >> unsigned int gaps = DR_GROUP_SIZE (first_stmt_info); > >> stmt_vec_info next_stmt_info = first_stmt_info; > >> do > >> { > >> gaps -= 1; > >> next_stmt_info = DR_GROUP_NEXT_ELEMENT > >> (next_stmt_info); > >> } > >> while (next_stmt_info); > >> if (gaps) > >> { > >> if (dump_enabled_p ()) > >> dump_printf_loc (MSG_NOTE, vect_location, > >> "vect_model_load_cost: %d " > >> "unused vectors.\n", > >> gaps); > >> vect_get_load_cost (vinfo, stmt_info, slp_node, gaps, > >> alignment_support_scheme, > >> misalignment, false, &inside_cost, > >> &prologue_cost, cost_vec, cost_vec, > >> true); > >> } > >> } > >> > >> that you pointed to in the PR. This predates the "all SLP" model. > >> It was originally added in r11-6662-ge45c41988bfd65, at a time when > >> vectorizable_load (and thus vect_model_load_cost) was called for each > >> individual *scalar* stmt. As a result, we generally costed one vector > >> load for each scalar load. One-off costs that applied to the whole > >> group were added only when costing the first stmt_info in the group. > >> > >> This meant that for load-lanes, we'd cost one vector_load for each > >> member of the LDn that was used. That was ok (in the sense of being > >> self-consistent) when all of the results were used. But it meant > >> that the cost of a fully-utilised LD4 was 4 times greater than > >> the cost of an LD4 in which only one result was used. That didn't > >> make any sense, since the instruction loaded the same amount of > >> data regardless. > >> > >> There's no need for that now that we only call vectorizable_load once > >> per SLP node, rather than once per scalar stmt, and now that we cost > >> the instructions in the same way that we generate them. > > > > Sure, but this patch intentionally avoided changing behavior for older > > cost models that was written with this behavior in mind. > > > > In particular almost all cost models without issue information have > > the extra permute costs for ldn as 0, including generic tuning. > > > > This is why the change avoided aarch64_tune_params.vec_costs->issue_info. > > > > Now I can fix generic tuning, but for every other core I'll have to add > > a very arbitrary value (probably the same as generic tuning). > > > > So If we're ok with that I can do that (and hopefully other targets > > don't object) > > But my point is that the existing AArch64 costs weren't written for > what the generic code currently does. The AArch64 costs were written > for the assumption that, in aggregate, N vector_loads would be costed > for each LDN, regardless of how many results are used. That's what used > to happen in GCC 11. > > This is why, for example: > > /* Per-vector cost of permuting vectors after an LD2, LD3 or LD4, > as well as the per-vector cost of permuting vectors before > an ST2, ST3 or ST4. */ > int ld2_st2_permute_cost; > int ld3_st3_permute_cost; > int ld4_st4_permute_cost; > > is per vector rather than per LDN. > > But if I'm following correctly, we now have the worst of both worlds. > This code: > > if (costing_p) > { > /* An IFN_LOAD_LANES will load all its vector results, > regardless of which ones we actually need. Account > for the cost of unused results. */ > if (first_stmt_info == stmt_info) > { > unsigned int gaps = DR_GROUP_SIZE (first_stmt_info); > stmt_vec_info next_stmt_info = first_stmt_info; > do > { > gaps -= 1; > next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info); > } > while (next_stmt_info); > if (gaps) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > "vect_model_load_cost: %d " > "unused vectors.\n", > gaps); > vect_get_load_cost (vinfo, stmt_info, slp_node, gaps, > alignment_support_scheme, > misalignment, false, &inside_cost, > &prologue_cost, cost_vec, cost_vec, > true); > } > } > n_adjacent_loads++; > continue; > } > > counts one load for the LDN itself plus one load for each unused result. > That doesn't make conceptual sense. The first_stmt_info == stmt_info > block is counting one vector_load *per vector*, but n_adjacent_loads++ > is counting one vector_load *per LDN*. In other words, we're mixing units. > > So in table form, the handling of LD4 seems to be: > > Number of used results GCC 11 Now > ---------------------- ------ ---- > 1 4 vector_loads 4 vector_loads > 2 4 vector_loads 3 vector_loads > 3 4 vector_loads 2 vector_loads > 4 4 vector_loads 1 vector_load > > If we remove the first_stmt_info == stmt_info block above, we'll get > one vector_load for all cases. AArch64 can then multiply that by > aarch64_ld234_st234_vectors to get the number of vectors. > > Alternatively, if we want generic code to present the same interface > as before, we could both remove the first_stmt_info == stmt_info block > and change: > > n_adjacent_loads++; > > to: > > n_adjacent_loads += group_size; > > But I can see the argument that counting 1 per LDN makes more sense > for generic code, so personally I prefer the first approach > (i.e. multiplying by aarch64_ld234_st234_vectors before applying > the per-vector costs). > > This is one of the cases where the "new" tree-vect*.cc approach to > costing avoids a lot of the complexity associated with the old approach, > in the sense that we can now perform the same calculation in a much > simpler way. But some of the code associated with the old approach > has been kept, even though it no longer applies. And that's messing > up the calculation.
Btw, I agree with removing the block. What we have now makes no sense. Richard. > > Thanks, > Richard
