https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
Alexander - the testcase at -O1 shows curiously high

   3.16%          9840  cc1plus  cc1plus             [.] mergesort<sort_ctx>

which is attributed (by callgrind) to

      if (sizeof (size_t) == 8 && LIKELY (c->size == 8))
-->     MERGE_ELTSIZE (8);

and the caller in tree-into-ssa.cc:prune_unused_phi_nodes doing

  qsort (defs, adef, sizeof (struct dom_dfsnum), cmp_dfsnum);

I'm not sure why callgrind pins it this way, but perf somewhat agrees:

Samples│    │MERGE_ELTSIZE (8);                                               
▒
     1 │2d0:│  mov    %r9,%rsi                                                
▒
     8 │    │  mov    %r9,0x8(%rsp)                                           
▒
   528 │    │  mov    %r12,%rdi                                               
▒
    31 │    │→ call   *0x0(%r13)                                              
▒
   236 │    │  mov    0x8(%rsp),%r9                                           
▒
     2 │    │  sar    $0x1f,%eax                                              
▒
   244 │    │  mov    %r12,%rcx                                               
▒
       │    │  movslq %eax,%rdx                                               
▒
   531 │    │  and    $0x8,%eax                                               
▒
    62 │    │  add    $0x8,%rbx                                               
▒
       │    │  cltq                                                           
◆
   725 │    │  xor    %r9,%rcx                                                
▒
   914 │    │  add    %rax,%r12                                               
▒
     1 │    │  and    %rdx,%rcx                                               
▒
       │    │  xor    %r9,%rcx                                                
▒
     3 │    │  mov    (%rcx),%rcx                                             
▒
  2155 │    │  mov    %rcx,-0x8(%rbx)                                         
▒
    29 │    │  cmp    %r12,%rbx                                               
▒
       │    └──je     1d7

I'll note the swapping of 8 bytes is a bit odd and it seems to be
if-converted, thus always doing a write.

I'm of course questioning what prune_unused_phi_nodes does but I have no
idea if that's sensible at all, but it seems slow for this testcase, and
the sorting is the slowest part of it.

Reply via email to