> Am 07.03.2026 um 12:09 schrieb Jørgen Kvalsvik <[email protected]>:
>
> The masking table was computed by considering the cartesian product of
> incoming edges, ordering the pairs, and doing upwards BFS searches
> from the sucessors of the lower topologically index'd ones (higher in
> the graph). The problem with this approach is that all the nodes we
> find from the higher candidates would also be found from the lower
> candidates, and since we want to collect the set intersection, any
> higher candidate would be dominated by lower candidates.
>
> We need only consider adjacent elements in the sorted set of
> candidates. This has a dramatic performance impact for large
> functions. The worst case is expressions on the form (x && y && ...)
> and (x || y || ...) with up-to 64 elements. I did a wallclock
> comparison of the full analysis phase (including emitting the GIMPLE):
>
> test.c:
> int fn (int a[])
> {
> (a[0] && a[1] && ...) // 64 times
> (a[0] && a[1] && ...) // 64 times
> ... // 500 times
> }
>
> int main ()
> {
> int a[64];
> for (int i = 0; i != 10000; ++i)
> {
> for (int k = 0; k != 64; ++k)
> a[k] = i % k;
> fn1 (a);
> }
> }
>
> Without this patch:
> fn1 instrumented in 20822.303 ms (41.645 ms per expression)
>
> With this patch:
> fn1 instrumented in 1288.548 ms (2.577 ms per expression)
Impressive
> I also tried considering terms left-to-right and, whenever the search
> found an already-processed expression it would stop the search and
> just insert its complete table entry, but this had no measurable
> impact on compile time, and the result was a slightly more complicated
> function.
>
> This inefficiency went unnoticed for a while, because these
> expressions aren't very common. The most I've seen in the wild is 27
> conditions, and that involved a lot of nested expressions which aren't
> impacted as much.
OK, assuming it bootstraps and tests OK
Richard
> gcc/ChangeLog:
>
> * tree-profile.cc (struct conds_ctx): Add edges.
> (topological_src_cmp): New function.
> (masking_vectors): New search strategy.
> ---
> gcc/tree-profile.cc | 44 ++++++++++++++++++++++++++++++++++----------
> 1 file changed, 34 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
> index fed218eb60b..7f803814a33 100644
> --- a/gcc/tree-profile.cc
> +++ b/gcc/tree-profile.cc
> @@ -182,6 +182,7 @@ struct conds_ctx
> auto_sbitmap G1;
> auto_sbitmap G2;
> auto_sbitmap G3;
> + auto_vec<edge, 64> edges;
>
> explicit conds_ctx (unsigned size) noexcept (true) : G1 (size), G2 (size),
> G3 (size)
> @@ -217,6 +218,16 @@ topological_cmp (const void *lhs, const void *rhs, void
> *top_index)
> return (*im)[l->index] - (*im)[r->index];
> }
>
> +/* topological_cmp of the src block of LHS and RHS. The TOP_INDEX argument
> + should be the top_index vector from ctx. */
> +int
> +topological_src_cmp (const void *lhs, const void *rhs, void *top_index)
> +{
> + const_edge l = *(const edge*) lhs;
> + const_edge r = *(const edge*) rhs;
> + return topological_cmp (&l->src, &r->src, top_index);
> +}
> +
> /* Find the index of NEEDLE in BLOCKS; return -1 if not found. This has two
> uses, sometimes for the index and sometimes for set member checks. Sets
> are
> typically very small (number of conditions, >8 is uncommon) so linear
> search
> @@ -437,7 +448,17 @@ condition_uid (struct function *fn, basic_block b)
>
> We find the terms by marking the outcomes (in this case c, T) and walk the
> predecessors starting at top (in this case b) and masking nodes when both
> - successors are marked.
> + successors are marked. This is equivalent to removing the two outcome
> nodes
> + of the subexpression and finding the nodes not in the inverse reachability
> + set.
> +
> + We only have to consider the pairs of top, bot where top is the the
> closest
> + (highest-index'd) candidate that still satisfies top < bot in the
> + topological order, as this will be the immediate left operand. The nodes
> of
> + the other left operands will also be found when going through the righmost
> + term, and a lower-index'd top would just find subsets. This has a
> + significant performance impact, 15-20x faster for the worst cases of (x
> && y
> + && ..) with no nesting.
>
> The masking table is represented as two bitfields per term in the
> expression
> with the index corresponding to the term in the Boolean expression.
> @@ -514,16 +535,19 @@ masking_vectors (conds_ctx& ctx,
> array_slice<basic_block> blocks,
> for (size_t i = 1; i != body.length (); i++)
> {
> const basic_block b = body[i];
> - for (edge e1 : b->preds)
> - for (edge e2 : b->preds)
> - {
> - if (e1 == e2)
> - continue;
> - if ((e1->flags | e2->flags) & EDGE_COMPLEX)
> - continue;
> + if (b->preds->length () < 2)
> + continue;
> + ctx.edges.truncate (0);
> + ctx.edges.reserve (b->preds->length ());
> + for (edge e : b->preds)
> + if (!(e->flags & EDGE_COMPLEX))
> + ctx.edges.quick_push (contract_edge_up (e));
> + ctx.edges.sort (topological_src_cmp, &ctx.top_index);
>
> - edge etop = contract_edge_up (e1);
> - edge ebot = contract_edge_up (e2);
> + for (size_t i0 = 0, i1 = 1; i1 != ctx.edges.length (); ++i0, ++i1)
> + {
> + edge etop = ctx.edges[i0];
> + edge ebot = ctx.edges[i1];
> gcc_assert (etop != ebot);
>
> const basic_block top = etop->src;
> --
> 2.47.3
>