https://gcc.gnu.org/g:b1859c2a3d6bcda3a54fe93385a78bfe6dbba945

commit r16-7942-gb1859c2a3d6bcda3a54fe93385a78bfe6dbba945
Author: Jørgen Kvalsvik <[email protected]>
Date:   Sat Mar 7 10:36:40 2026 +0100

    Improve speed of masking table algorithm for MC/DC
    
    The masking table was computed by considering the cartesian product of
    incoming edges, ordering the pairs, and doing upwards BFS searches
    from the sucessors of the lower topologically index'd ones (higher in
    the graph). The problem with this approach is that all the nodes we
    find from the higher candidates would also be found from the lower
    candidates, and since we want to collect the set intersection, any
    higher candidate would be dominated by lower candidates.
    
    We need only consider adjacent elements in the sorted set of
    candidates.  This has a dramatic performance impact for large
    functions.  The worst case is expressions on the form (x && y && ...)
    and (x || y || ...) with up-to 64 elements. I did a wallclock
    comparison of the full analysis phase (including emitting the GIMPLE):
    
    test.c:
        int fn (int a[])
        {
          (a[0] && a[1] && ...) // 64 times
          (a[0] && a[1] && ...) // 64 times
          ...                   // 500 times
        }
    
        int main ()
        {
          int a[64];
          for (int i = 0; i != 10000; ++i)
            {
              for (int k = 0; k != 64; ++k)
                a[k] = i % k;
              fn1 (a);
            }
        }
    
    Without this patch:
        fn1 instrumented in 20822.303 ms (41.645 ms per expression)
    
    With this patch:
        fn1 instrumented in 1288.548  ms (2.577  ms per expression)
    
    I also tried considering terms left-to-right and, whenever the search
    found an already-processed expression it would stop the search and
    just insert its complete table entry, but this had no measurable
    impact on compile time, and the result was a slightly more complicated
    function.
    
    This inefficiency went unnoticed for a while, because these
    expressions aren't very common.  The most I've seen in the wild is 27
    conditions, and that involved a lot of nested expressions which aren't
    impacted as much.
    
    gcc/ChangeLog:
    
            * tree-profile.cc (struct conds_ctx): Add edges.
            (topological_src_cmp): New function.
            (masking_vectors): New search strategy.

Diff:
---
 gcc/tree-profile.cc | 44 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
index f2f160ae742d..0499a31027c1 100644
--- a/gcc/tree-profile.cc
+++ b/gcc/tree-profile.cc
@@ -182,6 +182,7 @@ struct conds_ctx
     auto_sbitmap G1;
     auto_sbitmap G2;
     auto_sbitmap G3;
+    auto_vec<edge, 64> edges;
 
     explicit conds_ctx (unsigned size) noexcept (true) : G1 (size), G2 (size),
     G3 (size)
@@ -217,6 +218,16 @@ topological_cmp (const void *lhs, const void *rhs, void 
*top_index)
     return (*im)[l->index] - (*im)[r->index];
 }
 
+/* topological_cmp of the src block of LHS and RHS.  The TOP_INDEX argument
+   should be the top_index vector from ctx.  */
+int
+topological_src_cmp (const void *lhs, const void *rhs, void *top_index)
+{
+    const_edge l = *(const edge*) lhs;
+    const_edge r = *(const edge*) rhs;
+    return topological_cmp (&l->src, &r->src, top_index);
+}
+
 /* Find the index of NEEDLE in BLOCKS; return -1 if not found.  This has two
    uses, sometimes for the index and sometimes for set member checks.  Sets are
    typically very small (number of conditions, >8 is uncommon) so linear search
@@ -437,7 +448,17 @@ condition_uid (struct function *fn, basic_block b)
 
    We find the terms by marking the outcomes (in this case c, T) and walk the
    predecessors starting at top (in this case b) and masking nodes when both
-   successors are marked.
+   successors are marked.  This is equivalent to removing the two outcome nodes
+   of the subexpression and finding the nodes not in the inverse reachability
+   set.
+
+   We only have to consider the pairs of top, bot where top is the the closest
+   (highest-index'd) candidate that still satisfies top < bot in the
+   topological order, as this will be the immediate left operand.  The nodes of
+   the other left operands will also be found when going through the righmost
+   term, and a lower-index'd top would just find subsets.  This has a
+   significant performance impact, 15-20x faster for the worst cases of (x && y
+   && ..) with no nesting.
 
    The masking table is represented as two bitfields per term in the expression
    with the index corresponding to the term in the Boolean expression.
@@ -514,16 +535,19 @@ masking_vectors (conds_ctx& ctx, array_slice<basic_block> 
blocks,
     for (size_t i = 1; i != body.length (); i++)
     {
        const basic_block b = body[i];
-       for (edge e1 : b->preds)
-       for (edge e2 : b->preds)
-       {
-           if (e1 == e2)
-               continue;
-           if ((e1->flags | e2->flags) & EDGE_COMPLEX)
-               continue;
+       if (b->preds->length () < 2)
+         continue;
+       ctx.edges.truncate (0);
+       ctx.edges.reserve (b->preds->length ());
+       for (edge e : b->preds)
+         if (!(e->flags & EDGE_COMPLEX))
+           ctx.edges.quick_push (contract_edge_up (e));
+       ctx.edges.sort (topological_src_cmp, &ctx.top_index);
 
-           edge etop = contract_edge_up (e1);
-           edge ebot = contract_edge_up (e2);
+       for (size_t i0 = 0, i1 = 1; i1 != ctx.edges.length (); ++i0, ++i1)
+       {
+           edge etop = ctx.edges[i0];
+           edge ebot = ctx.edges[i1];
            gcc_assert (etop != ebot);
 
            const basic_block top = etop->src;

Reply via email to