[Bug c/125571] compiling with gcc 380 times slower than clang (19 minutes vs 3 seconds)

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 03 Jun 2026 02:37:16 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125571


--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Samples: 541K of event 'cycles:Pu', Event count (approx.): 753304421084         
Overhead       Samples  Command  Shared Object         Symbol                   
  87.32%        471980  cc1      cc1                   [.]
bitmap_set_bit(bitmap_head*, int)          ◆
   3.98%         21474  cc1      cc1                   [.] void
mergesort<sort_ctx>(char*, sort_ctx*, ▒
   2.56%         13817  cc1      cc1                   [.]
compute_idf(bitmap_head*, bitmap_head*)    ▒
   1.16%          6279  cc1      libc.so.6             [.]
__memmove_avx512_unaligned_erms            ▒
   1.13%          6087  cc1      cc1                   [.] cmp_dfsnum(void
const*, void const*)       

and that bitmap_set_bit is from compute_idf.

The testcase works so that compute_idf starts out with a work set of length 1
and requires 10002 iterations to converge, almost all iterations pushing
exactly a single item to the work set, with the DFS having two bits usually.
It also happens that the two bits are 3 and N, with N increasing.  So this
is for the merge PHIs after the if() blocks.  There's also one DFS with
a full set of bits, but we only process this last.

The 3 <-> N skips can be improved by avoiding to cache head->first in
head->current, but that's of course a hack.  Using a tree view for the
iteration
is not as effective, the access pattern is more of a worst case for a
splay tree it seems (but it's better than nothing, original 57s, with
3 <-> N hack 44s and with tree view 52s).

Now, an issue seems to be that when we process the iftmp.N vars which
have defs/uses just in a diamond

   if (...)
     iftmp.N = 49;
   else
     iftmp.N = 48;
   bar (iftmp.N);

we start with the insertion point at 'bar', but then the DFS of that block
includes block 9 (the successor) and block 3(!?) (the abnormal dispatcher)
from which we then end up with PHI insertion places everywhere.  We
then prune this by liveness info in insert_phi_nodes_for, but that's
after-the-fact.  I'm not sure we can do any of the pruning during IDF
computation to limit iteration though.

[Bug c/125571] compiling with gcc 380 times slower than clang (19 minutes vs 3 seconds)

Reply via email to