https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122218

            Bug ID: 122218
           Summary: Missed loop header duplication in walk_tree_1 with
                    profiledbootstrap
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

optimized dump of walk_tree_1 (profiledbootstrap with LTO and checking enabled)
starts with:


;;   basic block 2, loop depth 0, count 108482707983 (adjusted, freq 1.0000),
maybe hot
;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
;;    pred:       ENTRY [always]  count:108482707983 (adjusted, freq 1.0000)
(FALLTHRU,EXECUTABLE)
  goto <bb 4>; [100.00%]
;;    succ:       4 [always]  count:108482707983 (adjusted, freq 1.0000)
(FALLTHRU,EXECUTABLE)

;;   basic block 3, loop depth 1, count 72676903 (adjusted, freq 0.0007), maybe
hot
;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;;    pred:       17 [never]  count:0 (precise, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11693:2 discrim 2
;;                15 [always]  count:0 (precise, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11691:2 discrim 1
;;                79 [always]  count:35711872 (adjusted, freq 0.0003)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11738:2 discrim 1
;;                88 [always]  count:645603 (adjusted, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11750:2 discrim 1
;;                94 [never]  count:0 (precise, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11755:7 discrim 1
;;                103 [always]  count:130807 (adjusted, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11769:7 discrim 1
;;                135 [always]  count:8904466 (adjusted, freq 0.0001)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11785:2 discrim 2
;;                164 [never]  count:0 (precise, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11804:2 discrim 2
;;                202 [always]  count:7497173 (adjusted, freq 0.0001)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11816:2 discrim 1
;;                234 [always]  count:15992 (adjusted, freq 0.0000)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11896:4 discrim 1
;;                56 [always]  count:19770990 (adjusted, freq 0.0002)
(FALLTHRU,EXECUTABLE) ../../gcc/tree.cc:11724:7 discrim 1
  # tp_227 = PHI <tp_220(17), tp_222(15), _302(79), tp_196(88), tp_203(94),
tp_171(103), _355(135), tp_130(164), _477(202), tp_170(234), tp_211(56)>
;;    succ:       4 [always]  count:72676903 (adjusted, freq 0.0007)
(FALLTHRU,DFS_BACK,EXECUTABLE)

;;   basic block 4, loop depth 1, count 108555384886 (adjusted, freq 1.0007),
maybe hot
;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
;;    pred:       3 [always]  count:72676903 (adjusted, freq 0.0007)
(FALLTHRU,DFS_BACK,EXECUTABLE)
;;                2 [always]  count:108482707983 (adjusted, freq 1.0000)
(FALLTHRU,EXECUTABLE)
  # tp_232 = PHI <tp_227(3), tp_111(D)(2)>
  # DEBUG tp => tp_232
  # DEBUG BEGIN_STMT
  # DEBUG BEGIN_STMT
  _689 = *tp_232;
  if (_689 == 0B)
    goto <bb 5>; [20.71%]
  else
    goto <bb 7>; [79.29%]

Where BB 5 is leading directly to return.
So the test in BB 4 has (for *tp != NULL) has 20% chance of returning. It is
with frequency 1 reached from entry and with frequency 0.0007 via one of the
loopback edges which are hand optimized turning recursion to loop.

Now loop ch disambiguate loops:
fix_loop_structure: fixing up loops for function
Disambiguating loop 1 with multiple latches
Merged latch edges of loop 1
Created preheader block for loop 7
Created preheader block for loop 9
Created preheader block for loop 8
Created preheader block for loop 10


and as a result it gives up:

Estimating # of iterations of loop 1
  Not duplicating bb 214: it is single succ.

here BB 214 is one of the preheaders added.

Consequently codegen is bad:

   1.33 │       sub    $0x58,%rsp                     ◆
   0.48 │       mov    %rbx,0x30(%rsp)                ▒
   0.64 │       mov    %rcx,%rbx                      ▒
   0.53 │       mov    %rbp,0x38(%rsp)                ▒
   0.43 │       mov    %rsi,%rbp                      ▒
   0.59 │       mov    %r12,0x40(%rsp)                ▒
   0.48 │       mov    %rdx,%r12                      ▒
   0.90 │       mov    %r13,0x48(%rsp)                ▒
   0.21 │       mov    %rdi,%r13                      ▒
   0.53 │ 24:   mov    0x0(%r13),%rdx                 ▒
   3.09 │       test   %rdx,%rdx                      ▒
   0.48 │     ↓ je     89                             ▒

....

   0.59 │ 89:└─→xor    %eax,%eax                      ▒
   0.53 │ 8b:   mov    0x30(%rsp),%rbx                ▒
   0.85 │       mov    0x38(%rsp),%rbp                ▒
   1.17 │       mov    0x40(%rsp),%r12                ▒
   0.21 │       mov    0x48(%rsp),%r13                ▒
   0.27 │       add    $0x58,%rsp                     ▒
   1.07 │     ← ret                                   ▒

I got lost in shrink wrap, but I guess it is not able to fixup the loop
structure. Duplicating loop header here would make a lot of sense, but we would
need to understand that the loop header is shared by multiple loops (perhaps we
want to not disambiguate loops for CH pass, since it is simple enough so it can
deal with multiple latches?)

walk_tree_1 is a hot function in GCC profile.

Without profile loops disambiguate differently and header is copied.

     23 │       sub    $0x58,%rsp                     ◆
     29 │       mov    %r12,0x38(%rsp)                ▒
      2 │       mov    %rdx,%r12                      ▒
     11 │       mov    (%rdi),%rdx                    ▒
    142 │    ┌──test   %rdx,%rdx                      ▒
      5 │    ├──je     72                             ▒
        │    │  mov    %r14,0x48(%rsp)                ▒
     18 │    │  mov    %r15,0x50(%rsp)                ▒
     12 │    │  mov    %r8,(%rsp)                     ▒
        │    │  mov    %rbx,0x28(%rsp)                ▒
      3 │    │  mov    %rcx,%rbx                      ▒
      8 │    │  mov    %rbp,0x30(%rsp)                ▒
     11 │    │  mov    %rsi,%rbp                      ▒
     29 │    │  mov    %r13,0x40(%rsp)                ▒
      8 │    │  mov    %rdi,%r13                      ▒
      5 │ 3a:│  test   %rbx,%rbx                      ▒
     12 │    │↓ je     91                             ▒
        │    │  sar    $0x3,%rdx                      ▒
      1 │    │  mov    $0x1,%ecx                      ▒
      2 │    │  mov    %r13,%rsi                      ▒
        │    │  mov    %rbx,%rdi                      ▒
      1 │    │→ call   hash_table<default_hash_traits<▒
      1 │    │  cmpq   $0x0,(%rax)                    ▒
        │    │↓ je     80                             ▒
      1 │ 59:│  mov    0x28(%rsp),%rbx                ▒
     35 │    │  mov    0x30(%rsp),%rbp                ▒
      6 │    │  mov    0x40(%rsp),%r13                ▒
      5 │    │  mov    0x48(%rsp),%r14                ▒
     11 │    │  mov    0x50(%rsp),%r15                ▒
     24 │ 72:└─→mov    0x38(%rsp),%r12                ▒
     11 │       xor    %eax,%eax                      ▒
      2 │       add    $0x58,%rsp                     ▒
        │     ← ret                                   ▒

Reply via email to