https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125617

            Bug ID: 125617
           Summary: [x86_64] std::priority_queue push/pop ~2x slower than
                    clang at -O2: if-conversion puts a cmov chain on the
                    loop-carried critical path of __adjust_heap
           Product: gcc
           Version: 16.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: oscar.yang at cycraft dot com
  Target Milestone: ---

Created attachment 64634
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=64634&action=edit
the minimal reproduce code

On x86_64, libstdc++'s std::priority_queue is ~2x slower when built with gcc
than with clang at -O2, on a fixed-size push/pop (swap-in/swap-out) workload.
Both use the same libstdc++ headers, so this is a code-generation difference.

Root cause: gcc's RTL if-conversion compiles the "pick the larger child"
comparison in `__adjust_heap` (bits/stl_heap.h) into a 3-deep cmov chain: the
selected child value, index, and pointer are all chosen by the same compare,
and the index/pointer feed the next iteration's address and load. On the heap
descent this is a loop-carried dependency: each level must wait a full load
latency before the next load's address is known, so there is no memory-level
parallelism. clang keeps the child-selection as a conditional branch; on this
workload the branch is almost perfectly predicted, so the out-of-order core
speculates the next load and overlaps successive levels.

This is the pointer-chasing analogue of the reduction case in PR 82666: same
underlying "cmov on the loop-carried critical path" cost-model gap, but a
different shape. It is NOT a "cond ? x : 0" arithmetic select, so
`noce_try_cond_zero_arith` (PR 82666 comment #17) does not cover it. It is also
a real standard-library hot path, not only a synthetic loop.

Steps to reproduce:

  Save as pq.cpp:

```
    #include <queue>
    #include <vector>
    #include <cstdint>
    #include <random>
    #include <cstdio>
    __attribute__((noinline))
    uint64_t run(std::priority_queue<uint64_t>& q, const uint64_t* keys, int n,
long iters) {
        uint64_t sink = 0; size_t idx = n;
        for (long i = 0; i < iters; ++i) {
            q.push(keys[idx]); q.pop(); sink ^= q.top();   // swap-in /
swap-out
            if (++idx >= (size_t)(n + 4096)) idx = n;
        }
        return sink;
    }
    int main() {
        const int N = 4096; std::mt19937_64 rng(123);
        std::vector<uint64_t> keys(N + 4096); for (auto& k : keys) k = rng();
        std::priority_queue<uint64_t> q; for (int i = 0; i < N; ++i)
q.push(keys[i]);
        volatile uint64_t s = run(q, keys.data(), N, 80000000L);
        printf("%llu\n", (unsigned long long)s);
    }
```

  Build and measure:

    g++     -O2 -std=c++23 pq.cpp -o pq_gcc
    clang++ -O2 -std=c++23 pq.cpp -o pq_clang
    taskset -c 2 setarch -R perf stat -e
cycles:u,instructions:u,branches:u,branch-misses:u ./pq_gcc
    taskset -c 2 setarch -R perf stat -e
cycles:u,instructions:u,branches:u,branch-misses:u ./pq_clang

  Inspect the hot loop:  objdump -d pq_gcc | c++filt | sed -n '/<run/,/ret/p'

Environment: gcc 16.1.1 20260430, clang 22.1.6, x86_64, Tiger Lake (i5-1135G7),
-O2 -std=c++23.

Measured (N=4096, 8e7 push/pop):

    build                                            cmov in run()   instr:u  
cycles:u   IPC    wall
    g++ -O2                                           8              32.0e9   
14.5e9     2.20   3.51s
    g++ -O2 -fno-if-conversion -fno-if-conversion2    4              29.6e9    
7.5e9     3.94   1.94s
    clang++ -O2                                       1              24.6e9    
6.1e9     4.05   1.63s

Expected:
  Comparable performance to clang. The data-dependent child-selection should
stay a branch (or the loop should be restructured), since on
unpredictable-but-cheap data a well-predicted branch beats a cmov on the
loop-carried critical path.

Actual:
  gcc emits a cmov chain that serializes the heap descent. IPC is ~2.2 vs
clang's ~4.0, i.e. ~2x slower wall time, even though the working set (4096 * 8
B = 32 KB) fits in cache.

Why this is the if-conversion cost model, not misprediction or codegen size:
  - Branch-mispredict rate is < 0.1% in all three builds, so this is not a
misprediction problem.
  - -fno-if-conversion -fno-if-conversion2 (or --param
max-rtl-if-conversion-insns=0) turns the
    child-select back into a branch, roughly halves cycles and nearly doubles
IPC, while leaving the
    dynamic instruction count essentially unchanged (32.0e9 -> 29.6e9). The
cost is the cmov on the
    critical path, not extra work.
  - The gap persists at N=64 (fully L1-resident), so it is not memory-bound;
-O3 and -march=native
    do not help (they keep the cmov).

Question:
  PR 123017 added a predictability-aware if-conversion cost model for AArch64.
Would extending that model to the i386/x86-64 backend be the right direction
for shapes like this, where a well-predicted branch should win over a cmov on
the loop-carried critical path?

  Happy to provide full disassembly, perf counters for more N, or a
self-contained benchmark.

Related:
  - PR 56309 - the canonical x86 "conditional moves instead of compare and
branch result in almost 2x slower code".
  - PR 82666 - same cost-model gap in a reduction loop (sum += (x>128 ? x :
0)).
  - PR 123017 - predictability-aware if-conversion cost model, added for
AArch64.

Reply via email to