https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124654

            Bug ID: 124654
           Summary: Suboptimal carry chain after mulx in
                    multiply-accumulate loop
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: chfast at gmail dot com
  Target Milestone: ---

In a multiply-accumulate loop (the hot inner loop of bigint multiplication),
GCC generates a suboptimal carry chain after mulx. Where Clang uses a single
`adcq` to combine the multiply high word with both carry flags, GCC uses
`setc + movzbl + addq + addq` — 4 instructions instead of 1.

This causes ~70% more instructions per loop iteration (12 vs 7 in the
unrolled case) and measurable ~2x slowdown in modular exponentiation
benchmarks (256-bit Montgomery multiplication).

Test case (compile with: g++ -O3 -march=x86-64-v3 -S):

    typedef unsigned __int128 u128;
    typedef unsigned long long u64;

    u64 addmul(u64* r, const u64* p, const u64* x, int n, u64 y) {
        u64 c = 0;
    #pragma GCC unroll 1
        for (int i = 0; i < n; ++i) {
            u128 t = u128{x[i]} * y;
            u64 lo = u64(t), hi = u64(t >> 64);
            u64 s = lo + p[i]; bool c1 = s < lo;
            u64 s2 = s + c;    bool c2 = s2 < s;
            r[i] = s2;
            c = hi + c1 + c2;
        }
        return c;
    }

GCC 16.0 (trunk) loop body (-O3 -march=x86-64-v3):

    .L7:
        mov     rdx, r8
        xor     edi, edi
        mulx    r11, rax, QWORD PTR [rbp+0+r14]
        add     rax, QWORD PTR [r9+r14]
        setc    dil                      # carry1 (zero-extended via xorl
above)
        add     rax, rsi
        setc    sil                      # carry2
        mov     QWORD PTR [rbx+r14], rax
        add     r14, 8
        movzx   esi, sil                 # zero-extend carry2
        add     rdi, r11                 # c = hi + carry1
        add     rsi, rdi                 # c += carry2
        cmp     r14, rcx
        jne     .L7

    12 instructions per iteration.
    `c = hi + carry1 + carry2` takes 4 instructions: setc + movzx + add + add.

Clang 23.0 (trunk) loop body (-O3 -march=x86-64-v3):

    .LBB0_4:
        mov     rdx, r8
        mulx    rdx, rbx, qword ptr [r9 + 8*r10]
        xor     eax, eax
        add     rbx, qword ptr [rsi + 8*r10]
        setb    al                       # carry1 (zero-extended via xorl
above)
        add     rbx, r11
        mov     qword ptr [rdi + 8*r10], rbx
        adc     rax, rdx                 # c = hi + carry1 + carry2 (ONE
instruction)
        inc     r10
        mov     r11, rax
        cmp     rcx, r10
        jne     .LBB0_4

    10 instructions per iteration (could be 9 without the mov rename).
    `c = hi + carry1 + carry2` takes 1 instruction: adcq.

The key missed optimization: after the second `addq` (lo += c), the carry flag
already holds carry2. GCC materializes it with `setc + movzbl` and uses `addq`
to combine with hi. Clang recognizes that `adcq %rdx, %rax` (where rax=carry1
from earlier setb) computes `hi + carry1 + CF` in a single instruction.

Godbolt (GCC trunk vs Clang trunk): https://godbolt.org/z/4drshKov9

Reply via email to