https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> --- Alexander - the testcase at -O1 shows curiously high 3.16% 9840 cc1plus cc1plus [.] mergesort<sort_ctx> which is attributed (by callgrind) to if (sizeof (size_t) == 8 && LIKELY (c->size == 8)) --> MERGE_ELTSIZE (8); and the caller in tree-into-ssa.cc:prune_unused_phi_nodes doing qsort (defs, adef, sizeof (struct dom_dfsnum), cmp_dfsnum); I'm not sure why callgrind pins it this way, but perf somewhat agrees: Samples│ │MERGE_ELTSIZE (8); ▒ 1 │2d0:│ mov %r9,%rsi ▒ 8 │ │ mov %r9,0x8(%rsp) ▒ 528 │ │ mov %r12,%rdi ▒ 31 │ │→ call *0x0(%r13) ▒ 236 │ │ mov 0x8(%rsp),%r9 ▒ 2 │ │ sar $0x1f,%eax ▒ 244 │ │ mov %r12,%rcx ▒ │ │ movslq %eax,%rdx ▒ 531 │ │ and $0x8,%eax ▒ 62 │ │ add $0x8,%rbx ▒ │ │ cltq ◆ 725 │ │ xor %r9,%rcx ▒ 914 │ │ add %rax,%r12 ▒ 1 │ │ and %rdx,%rcx ▒ │ │ xor %r9,%rcx ▒ 3 │ │ mov (%rcx),%rcx ▒ 2155 │ │ mov %rcx,-0x8(%rbx) ▒ 29 │ │ cmp %r12,%rbx ▒ │ └──je 1d7 I'll note the swapping of 8 bytes is a bit odd and it seems to be if-converted, thus always doing a write. I'm of course questioning what prune_unused_phi_nodes does but I have no idea if that's sensible at all, but it seems slow for this testcase, and the sorting is the slowest part of it.