https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125617
Bug ID: 125617
Summary: [x86_64] std::priority_queue push/pop ~2x slower than
clang at -O2: if-conversion puts a cmov chain on the
loop-carried critical path of __adjust_heap
Product: gcc
Version: 16.1.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: oscar.yang at cycraft dot com
Target Milestone: ---
Created attachment 64634
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=64634&action=edit
the minimal reproduce code
On x86_64, libstdc++'s std::priority_queue is ~2x slower when built with gcc
than with clang at -O2, on a fixed-size push/pop (swap-in/swap-out) workload.
Both use the same libstdc++ headers, so this is a code-generation difference.
Root cause: gcc's RTL if-conversion compiles the "pick the larger child"
comparison in `__adjust_heap` (bits/stl_heap.h) into a 3-deep cmov chain: the
selected child value, index, and pointer are all chosen by the same compare,
and the index/pointer feed the next iteration's address and load. On the heap
descent this is a loop-carried dependency: each level must wait a full load
latency before the next load's address is known, so there is no memory-level
parallelism. clang keeps the child-selection as a conditional branch; on this
workload the branch is almost perfectly predicted, so the out-of-order core
speculates the next load and overlaps successive levels.
This is the pointer-chasing analogue of the reduction case in PR 82666: same
underlying "cmov on the loop-carried critical path" cost-model gap, but a
different shape. It is NOT a "cond ? x : 0" arithmetic select, so
`noce_try_cond_zero_arith` (PR 82666 comment #17) does not cover it. It is also
a real standard-library hot path, not only a synthetic loop.
Steps to reproduce:
Save as pq.cpp:
```
#include <queue>
#include <vector>
#include <cstdint>
#include <random>
#include <cstdio>
__attribute__((noinline))
uint64_t run(std::priority_queue<uint64_t>& q, const uint64_t* keys, int n,
long iters) {
uint64_t sink = 0; size_t idx = n;
for (long i = 0; i < iters; ++i) {
q.push(keys[idx]); q.pop(); sink ^= q.top(); // swap-in /
swap-out
if (++idx >= (size_t)(n + 4096)) idx = n;
}
return sink;
}
int main() {
const int N = 4096; std::mt19937_64 rng(123);
std::vector<uint64_t> keys(N + 4096); for (auto& k : keys) k = rng();
std::priority_queue<uint64_t> q; for (int i = 0; i < N; ++i)
q.push(keys[i]);
volatile uint64_t s = run(q, keys.data(), N, 80000000L);
printf("%llu\n", (unsigned long long)s);
}
```
Build and measure:
g++ -O2 -std=c++23 pq.cpp -o pq_gcc
clang++ -O2 -std=c++23 pq.cpp -o pq_clang
taskset -c 2 setarch -R perf stat -e
cycles:u,instructions:u,branches:u,branch-misses:u ./pq_gcc
taskset -c 2 setarch -R perf stat -e
cycles:u,instructions:u,branches:u,branch-misses:u ./pq_clang
Inspect the hot loop: objdump -d pq_gcc | c++filt | sed -n '/<run/,/ret/p'
Environment: gcc 16.1.1 20260430, clang 22.1.6, x86_64, Tiger Lake (i5-1135G7),
-O2 -std=c++23.
Measured (N=4096, 8e7 push/pop):
build cmov in run() instr:u
cycles:u IPC wall
g++ -O2 8 32.0e9
14.5e9 2.20 3.51s
g++ -O2 -fno-if-conversion -fno-if-conversion2 4 29.6e9
7.5e9 3.94 1.94s
clang++ -O2 1 24.6e9
6.1e9 4.05 1.63s
Expected:
Comparable performance to clang. The data-dependent child-selection should
stay a branch (or the loop should be restructured), since on
unpredictable-but-cheap data a well-predicted branch beats a cmov on the
loop-carried critical path.
Actual:
gcc emits a cmov chain that serializes the heap descent. IPC is ~2.2 vs
clang's ~4.0, i.e. ~2x slower wall time, even though the working set (4096 * 8
B = 32 KB) fits in cache.
Why this is the if-conversion cost model, not misprediction or codegen size:
- Branch-mispredict rate is < 0.1% in all three builds, so this is not a
misprediction problem.
- -fno-if-conversion -fno-if-conversion2 (or --param
max-rtl-if-conversion-insns=0) turns the
child-select back into a branch, roughly halves cycles and nearly doubles
IPC, while leaving the
dynamic instruction count essentially unchanged (32.0e9 -> 29.6e9). The
cost is the cmov on the
critical path, not extra work.
- The gap persists at N=64 (fully L1-resident), so it is not memory-bound;
-O3 and -march=native
do not help (they keep the cmov).
Question:
PR 123017 added a predictability-aware if-conversion cost model for AArch64.
Would extending that model to the i386/x86-64 backend be the right direction
for shapes like this, where a well-predicted branch should win over a cmov on
the loop-carried critical path?
Happy to provide full disassembly, perf counters for more N, or a
self-contained benchmark.
Related:
- PR 56309 - the canonical x86 "conditional moves instead of compare and
branch result in almost 2x slower code".
- PR 82666 - same cost-model gap in a reduction loop (sum += (x>128 ? x :
0)).
- PR 123017 - predictability-aware if-conversion cost model, added for
AArch64.