https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125571
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Samples: 541K of event 'cycles:Pu', Event count (approx.): 753304421084
Overhead Samples Command Shared Object Symbol
87.32% 471980 cc1 cc1 [.]
bitmap_set_bit(bitmap_head*, int) ◆
3.98% 21474 cc1 cc1 [.] void
mergesort<sort_ctx>(char*, sort_ctx*, ▒
2.56% 13817 cc1 cc1 [.]
compute_idf(bitmap_head*, bitmap_head*) ▒
1.16% 6279 cc1 libc.so.6 [.]
__memmove_avx512_unaligned_erms ▒
1.13% 6087 cc1 cc1 [.] cmp_dfsnum(void
const*, void const*)
and that bitmap_set_bit is from compute_idf.
The testcase works so that compute_idf starts out with a work set of length 1
and requires 10002 iterations to converge, almost all iterations pushing
exactly a single item to the work set, with the DFS having two bits usually.
It also happens that the two bits are 3 and N, with N increasing. So this
is for the merge PHIs after the if() blocks. There's also one DFS with
a full set of bits, but we only process this last.
The 3 <-> N skips can be improved by avoiding to cache head->first in
head->current, but that's of course a hack. Using a tree view for the
iteration
is not as effective, the access pattern is more of a worst case for a
splay tree it seems (but it's better than nothing, original 57s, with
3 <-> N hack 44s and with tree view 52s).
Now, an issue seems to be that when we process the iftmp.N vars which
have defs/uses just in a diamond
if (...)
iftmp.N = 49;
else
iftmp.N = 48;
bar (iftmp.N);
we start with the insertion point at 'bar', but then the DFS of that block
includes block 9 (the successor) and block 3(!?) (the abnormal dispatcher)
from which we then end up with PHI insertion places everywhere. We
then prune this by liveness info in insert_phi_nodes_for, but that's
after-the-fact. I'm not sure we can do any of the pruning during IDF
computation to limit iteration though.