https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124654
Bug ID: 124654
Summary: Suboptimal carry chain after mulx in
multiply-accumulate loop
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: chfast at gmail dot com
Target Milestone: ---
In a multiply-accumulate loop (the hot inner loop of bigint multiplication),
GCC generates a suboptimal carry chain after mulx. Where Clang uses a single
`adcq` to combine the multiply high word with both carry flags, GCC uses
`setc + movzbl + addq + addq` — 4 instructions instead of 1.
This causes ~70% more instructions per loop iteration (12 vs 7 in the
unrolled case) and measurable ~2x slowdown in modular exponentiation
benchmarks (256-bit Montgomery multiplication).
Test case (compile with: g++ -O3 -march=x86-64-v3 -S):
typedef unsigned __int128 u128;
typedef unsigned long long u64;
u64 addmul(u64* r, const u64* p, const u64* x, int n, u64 y) {
u64 c = 0;
#pragma GCC unroll 1
for (int i = 0; i < n; ++i) {
u128 t = u128{x[i]} * y;
u64 lo = u64(t), hi = u64(t >> 64);
u64 s = lo + p[i]; bool c1 = s < lo;
u64 s2 = s + c; bool c2 = s2 < s;
r[i] = s2;
c = hi + c1 + c2;
}
return c;
}
GCC 16.0 (trunk) loop body (-O3 -march=x86-64-v3):
.L7:
mov rdx, r8
xor edi, edi
mulx r11, rax, QWORD PTR [rbp+0+r14]
add rax, QWORD PTR [r9+r14]
setc dil # carry1 (zero-extended via xorl
above)
add rax, rsi
setc sil # carry2
mov QWORD PTR [rbx+r14], rax
add r14, 8
movzx esi, sil # zero-extend carry2
add rdi, r11 # c = hi + carry1
add rsi, rdi # c += carry2
cmp r14, rcx
jne .L7
12 instructions per iteration.
`c = hi + carry1 + carry2` takes 4 instructions: setc + movzx + add + add.
Clang 23.0 (trunk) loop body (-O3 -march=x86-64-v3):
.LBB0_4:
mov rdx, r8
mulx rdx, rbx, qword ptr [r9 + 8*r10]
xor eax, eax
add rbx, qword ptr [rsi + 8*r10]
setb al # carry1 (zero-extended via xorl
above)
add rbx, r11
mov qword ptr [rdi + 8*r10], rbx
adc rax, rdx # c = hi + carry1 + carry2 (ONE
instruction)
inc r10
mov r11, rax
cmp rcx, r10
jne .LBB0_4
10 instructions per iteration (could be 9 without the mov rename).
`c = hi + carry1 + carry2` takes 1 instruction: adcq.
The key missed optimization: after the second `addq` (lo += c), the carry flag
already holds carry2. GCC materializes it with `setc + movzbl` and uses `addq`
to combine with hi. Clang recognizes that `adcq %rdx, %rax` (where rax=carry1
from earlier setb) computes `hi + carry1 + CF` in a single instruction.
Godbolt (GCC trunk vs Clang trunk): https://godbolt.org/z/4drshKov9