https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122218
Bug ID: 122218
Summary: Missed loop header duplication in walk_tree_1 with
profiledbootstrap
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
optimized dump of walk_tree_1 (profiledbootstrap with LTO and checking enabled)
starts with:
;; basic block 2, loop depth 0, count 108482707983 (adjusted, freq 1.0000),
maybe hot
;; prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
;; pred: ENTRY [always] count:108482707983 (adjusted, freq 1.0000)
(FALLTHRU,EXECUTABLE)
goto <bb 4>; [100.00%]
;; succ: 4 [always] count:108482707983 (adjusted, freq 1.0000)
(FALLTHRU,EXECUTABLE)
;; basic block 3, loop depth 1, count 72676903 (adjusted, freq 0.0007), maybe
hot
;; prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;; pred: 17 [never] count:0 (precise, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11693:2 discrim 2
;; 15 [always] count:0 (precise, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11691:2 discrim 1
;; 79 [always] count:35711872 (adjusted, freq 0.0003)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11738:2 discrim 1
;; 88 [always] count:645603 (adjusted, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11750:2 discrim 1
;; 94 [never] count:0 (precise, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11755:7 discrim 1
;; 103 [always] count:130807 (adjusted, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11769:7 discrim 1
;; 135 [always] count:8904466 (adjusted, freq 0.0001)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11785:2 discrim 2
;; 164 [never] count:0 (precise, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11804:2 discrim 2
;; 202 [always] count:7497173 (adjusted, freq 0.0001)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11816:2 discrim 1
;; 234 [always] count:15992 (adjusted, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11896:4 discrim 1
;; 56 [always] count:19770990 (adjusted, freq 0.0002)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11724:7 discrim 1
# tp_227 = PHI <tp_220(17), tp_222(15), _302(79), tp_196(88), tp_203(94),
tp_171(103), _355(135), tp_130(164), _477(202), tp_170(234), tp_211(56)>
;; succ: 4 [always] count:72676903 (adjusted, freq 0.0007)
(FALLTHRU,DFS_BACK,EXECUTABLE)
;; basic block 4, loop depth 1, count 108555384886 (adjusted, freq 1.0007),
maybe hot
;; prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
;; pred: 3 [always] count:72676903 (adjusted, freq 0.0007)
(FALLTHRU,DFS_BACK,EXECUTABLE)
;; 2 [always] count:108482707983 (adjusted, freq 1.0000)
(FALLTHRU,EXECUTABLE)
# tp_232 = PHI <tp_227(3), tp_111(D)(2)>
# DEBUG tp => tp_232
# DEBUG BEGIN_STMT
# DEBUG BEGIN_STMT
_689 = *tp_232;
if (_689 == 0B)
goto <bb 5>; [20.71%]
else
goto <bb 7>; [79.29%]
Where BB 5 is leading directly to return.
So the test in BB 4 has (for *tp != NULL) has 20% chance of returning. It is
with frequency 1 reached from entry and with frequency 0.0007 via one of the
loopback edges which are hand optimized turning recursion to loop.
Now loop ch disambiguate loops:
fix_loop_structure: fixing up loops for function
Disambiguating loop 1 with multiple latches
Merged latch edges of loop 1
Created preheader block for loop 7
Created preheader block for loop 9
Created preheader block for loop 8
Created preheader block for loop 10
and as a result it gives up:
Estimating # of iterations of loop 1
Not duplicating bb 214: it is single succ.
here BB 214 is one of the preheaders added.
Consequently codegen is bad:
1.33 │ sub $0x58,%rsp ◆
0.48 │ mov %rbx,0x30(%rsp) ▒
0.64 │ mov %rcx,%rbx ▒
0.53 │ mov %rbp,0x38(%rsp) ▒
0.43 │ mov %rsi,%rbp ▒
0.59 │ mov %r12,0x40(%rsp) ▒
0.48 │ mov %rdx,%r12 ▒
0.90 │ mov %r13,0x48(%rsp) ▒
0.21 │ mov %rdi,%r13 ▒
0.53 │ 24: mov 0x0(%r13),%rdx ▒
3.09 │ test %rdx,%rdx ▒
0.48 │ ↓ je 89 ▒
....
0.59 │ 89:└─→xor %eax,%eax ▒
0.53 │ 8b: mov 0x30(%rsp),%rbx ▒
0.85 │ mov 0x38(%rsp),%rbp ▒
1.17 │ mov 0x40(%rsp),%r12 ▒
0.21 │ mov 0x48(%rsp),%r13 ▒
0.27 │ add $0x58,%rsp ▒
1.07 │ ← ret ▒
I got lost in shrink wrap, but I guess it is not able to fixup the loop
structure. Duplicating loop header here would make a lot of sense, but we would
need to understand that the loop header is shared by multiple loops (perhaps we
want to not disambiguate loops for CH pass, since it is simple enough so it can
deal with multiple latches?)
walk_tree_1 is a hot function in GCC profile.
Without profile loops disambiguate differently and header is copied.
23 │ sub $0x58,%rsp ◆
29 │ mov %r12,0x38(%rsp) ▒
2 │ mov %rdx,%r12 ▒
11 │ mov (%rdi),%rdx ▒
142 │ ┌──test %rdx,%rdx ▒
5 │ ├──je 72 ▒
│ │ mov %r14,0x48(%rsp) ▒
18 │ │ mov %r15,0x50(%rsp) ▒
12 │ │ mov %r8,(%rsp) ▒
│ │ mov %rbx,0x28(%rsp) ▒
3 │ │ mov %rcx,%rbx ▒
8 │ │ mov %rbp,0x30(%rsp) ▒
11 │ │ mov %rsi,%rbp ▒
29 │ │ mov %r13,0x40(%rsp) ▒
8 │ │ mov %rdi,%r13 ▒
5 │ 3a:│ test %rbx,%rbx ▒
12 │ │↓ je 91 ▒
│ │ sar $0x3,%rdx ▒
1 │ │ mov $0x1,%ecx ▒
2 │ │ mov %r13,%rsi ▒
│ │ mov %rbx,%rdi ▒
1 │ │→ call hash_table<default_hash_traits<▒
1 │ │ cmpq $0x0,(%rax) ▒
│ │↓ je 80 ▒
1 │ 59:│ mov 0x28(%rsp),%rbx ▒
35 │ │ mov 0x30(%rsp),%rbp ▒
6 │ │ mov 0x40(%rsp),%r13 ▒
5 │ │ mov 0x48(%rsp),%r14 ▒
11 │ │ mov 0x50(%rsp),%r15 ▒
24 │ 72:└─→mov 0x38(%rsp),%r12 ▒
11 │ xor %eax,%eax ▒
2 │ add $0x58,%rsp ▒
│ ← ret ▒