> Am 07.03.2026 um 12:09 schrieb Jørgen Kvalsvik <[email protected]>:
> 
> The masking table was computed by considering the cartesian product of
> incoming edges, ordering the pairs, and doing upwards BFS searches
> from the sucessors of the lower topologically index'd ones (higher in
> the graph). The problem with this approach is that all the nodes we
> find from the higher candidates would also be found from the lower
> candidates, and since we want to collect the set intersection, any
> higher candidate would be dominated by lower candidates.
> 
> We need only consider adjacent elements in the sorted set of
> candidates.  This has a dramatic performance impact for large
> functions.  The worst case is expressions on the form (x && y && ...)
> and (x || y || ...) with up-to 64 elements. I did a wallclock
> comparison of the full analysis phase (including emitting the GIMPLE):
> 
> test.c:
>    int fn (int a[])
>    {
>      (a[0] && a[1] && ...) // 64 times
>      (a[0] && a[1] && ...) // 64 times
>      ...                   // 500 times
>    }
> 
>    int main ()
>    {
>      int a[64];
>      for (int i = 0; i != 10000; ++i)
>        {
>          for (int k = 0; k != 64; ++k)
>            a[k] = i % k;
>          fn1 (a);
>        }
>    }
> 
> Without this patch:
>    fn1 instrumented in 20822.303 ms (41.645 ms per expression)
> 
> With this patch:
>    fn1 instrumented in 1288.548  ms (2.577  ms per expression)

Impressive 

> I also tried considering terms left-to-right and, whenever the search
> found an already-processed expression it would stop the search and
> just insert its complete table entry, but this had no measurable
> impact on compile time, and the result was a slightly more complicated
> function.
> 
> This inefficiency went unnoticed for a while, because these
> expressions aren't very common.  The most I've seen in the wild is 27
> conditions, and that involved a lot of nested expressions which aren't
> impacted as much.

OK, assuming it bootstraps and tests OK

Richard 

> gcc/ChangeLog:
> 
>    * tree-profile.cc (struct conds_ctx): Add edges.
>    (topological_src_cmp): New function.
>    (masking_vectors): New search strategy.
> ---
> gcc/tree-profile.cc | 44 ++++++++++++++++++++++++++++++++++----------
> 1 file changed, 34 insertions(+), 10 deletions(-)
> 
> diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
> index fed218eb60b..7f803814a33 100644
> --- a/gcc/tree-profile.cc
> +++ b/gcc/tree-profile.cc
> @@ -182,6 +182,7 @@ struct conds_ctx
>     auto_sbitmap G1;
>     auto_sbitmap G2;
>     auto_sbitmap G3;
> +    auto_vec<edge, 64> edges;
> 
>     explicit conds_ctx (unsigned size) noexcept (true) : G1 (size), G2 (size),
>     G3 (size)
> @@ -217,6 +218,16 @@ topological_cmp (const void *lhs, const void *rhs, void 
> *top_index)
>     return (*im)[l->index] - (*im)[r->index];
> }
> 
> +/* topological_cmp of the src block of LHS and RHS.  The TOP_INDEX argument
> +   should be the top_index vector from ctx.  */
> +int
> +topological_src_cmp (const void *lhs, const void *rhs, void *top_index)
> +{
> +    const_edge l = *(const edge*) lhs;
> +    const_edge r = *(const edge*) rhs;
> +    return topological_cmp (&l->src, &r->src, top_index);
> +}
> +
> /* Find the index of NEEDLE in BLOCKS; return -1 if not found.  This has two
>    uses, sometimes for the index and sometimes for set member checks.  Sets 
> are
>    typically very small (number of conditions, >8 is uncommon) so linear 
> search
> @@ -437,7 +448,17 @@ condition_uid (struct function *fn, basic_block b)
> 
>    We find the terms by marking the outcomes (in this case c, T) and walk the
>    predecessors starting at top (in this case b) and masking nodes when both
> -   successors are marked.
> +   successors are marked.  This is equivalent to removing the two outcome 
> nodes
> +   of the subexpression and finding the nodes not in the inverse reachability
> +   set.
> +
> +   We only have to consider the pairs of top, bot where top is the the 
> closest
> +   (highest-index'd) candidate that still satisfies top < bot in the
> +   topological order, as this will be the immediate left operand.  The nodes 
> of
> +   the other left operands will also be found when going through the righmost
> +   term, and a lower-index'd top would just find subsets.  This has a
> +   significant performance impact, 15-20x faster for the worst cases of (x 
> && y
> +   && ..) with no nesting.
> 
>    The masking table is represented as two bitfields per term in the 
> expression
>    with the index corresponding to the term in the Boolean expression.
> @@ -514,16 +535,19 @@ masking_vectors (conds_ctx& ctx, 
> array_slice<basic_block> blocks,
>     for (size_t i = 1; i != body.length (); i++)
>     {
>    const basic_block b = body[i];
> -    for (edge e1 : b->preds)
> -    for (edge e2 : b->preds)
> -    {
> -        if (e1 == e2)
> -        continue;
> -        if ((e1->flags | e2->flags) & EDGE_COMPLEX)
> -        continue;
> +    if (b->preds->length () < 2)
> +      continue;
> +    ctx.edges.truncate (0);
> +    ctx.edges.reserve (b->preds->length ());
> +    for (edge e : b->preds)
> +      if (!(e->flags & EDGE_COMPLEX))
> +        ctx.edges.quick_push (contract_edge_up (e));
> +    ctx.edges.sort (topological_src_cmp, &ctx.top_index);
> 
> -        edge etop = contract_edge_up (e1);
> -        edge ebot = contract_edge_up (e2);
> +    for (size_t i0 = 0, i1 = 1; i1 != ctx.edges.length (); ++i0, ++i1)
> +    {
> +        edge etop = ctx.edges[i0];
> +        edge ebot = ctx.edges[i1];
>        gcc_assert (etop != ebot);
> 
>        const basic_block top = etop->src;
> --
> 2.47.3
> 

Reply via email to