128 bit multiply via __uint128_t not optimized

pcordes at gmail dot com via Gcc-bugs Tue, 30 Dec 2025 14:37:35 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51837


--- Comment #4 from Peter Cordes <pcordes at gmail dot com> ---
It's far from optimal for 32x32 -> 64 in 64-bit mode, on old CPUs with slow
64x64-bit multiply like -mtune=atom or early silvermont-family or -mtune=bdver1
.  These CPUs are mostly irrelevant now, but code-gen for foo hasn't changed
for -m32 or -m64 since GCC6.1.  In those CPUs, it can save total uops and be
better throughput to use MULL (32-bit operand-size), or especially MULX if
available, so the high half is already in a register by itself when that's what
the code wants.  (On AMD Bulldozer, imul r64, r64 has 6 cycle latency and 4c
throughput.  But mul r32 has 4c latency and 2c throughput, and is still a
single uop unlike on Intel.  The difference is bigger on early Atom)


There are GCC6.5 vs. trunk code-gen differences when returning a struct of 2
values from high-halves of two multiplies as in the Godbolt link in my previous
comment.


Anyway, currently we compile foo() with z ^ (z >> 32); on a 64-bit product into
this asm from GCC -O2 -m64 -march=bdver1

https://gcc.godbolt.org/z/oqr8q4Ybh
foo:
        movl    %esi, %esi
        movl    %edi, %edi   # zero-extend both inputs
        imulq   %rsi, %rdi
        movq    %rdi, %rax
        shrq    $32, %rax
        xorl    %edi, %eax
        ret

Not counting the RET, that's 6 instructions, 6 uops on most modern cores.
Critical path latency from either input is 1 + 3 + 0 + 1 + 1 = 6 cycles,
assuming a 3-cycle imul latency like modern CPUs, and mov-elimination for
integer regs like modern CPUs other than Ice Lake.  (The first 2 mov
instructions could be zero latency and no back-end uop if they weren't mov
same,same.)

A pure improvement with no downsides would shorten the critical path to 5
cycles (except on Ice Lake) by letting mov-elim work on every MOV, and taking
the later MOV off the critical path.

## tweaked compiler output changing only register choices
        movl    %esi, %eax   # not same-same so mov-elim can work
        movl    %edi, %ecx   # or use ESI as a destination if we don't mind
destroying it.
        imulq   %rcx, %rax
        movq    %rax, %rcx   # off the critical path even without mov-elim
        shrq    $32, %rcx
        xorl    %ecx, %eax
        ret


#### We could instead be doing:
#### hand-written function, good if inputs aren't already zero-extended
#### or on old CPUs with slow 64-bit multiply

        movl    %esi, %eax
        mull    %edi         # 3 uops on modern Intel
        xorl    %edx, %eax
        ret

This also has a critical-path latency of 5 cycles on CPUs like Skylake where
MUL takes an extra cycle to write the high half.  (https://uops.info/ has
latency breakdowns per input -> output pair, but it's down at the moment.)

But it's only 5 total uops (not counting the RET) instead of 6 on Intel
P-cores, and 4 on Zen-family.
(32-bit widening MUL on Intel P-cores is 3 uops.  64-bit widening MUL is 2
uops.)
And this is much better on early CPUs with slow 64-bit multiply.
On Bulldozer-family, MULL is only 1 uop for the front-end, so extra good there.

This case is fairly artificial; it's rare that you need to combine low and high
halves of a 64-bit multiply.  If just storing them, we could destroy the low
half with a shift and save 1 MOV in the current code-gen.  Then it's break-even
for front-end cost.

So this is about equal on modern CPUs, probably not worse on any, and better on
old CPUs.  And is better for code-size, so the main benefit would be with -Os.

Also, this is artificial in terms of the IMULQ version needing to zero-extend
both inputs, unlike if it had inlined into code that loaded or produced those
value.  Then the IMUL version would only be 4 uops.  (Or 5 if neither multiply
input is dead so we needed to save it with MOV and don't have AMX for
non-destructive IMUL.)

So most of the time using 1-operand 32-bit MULL is only barely worth it, only a
big win on old CPUs like -march=bdver3, atom, silvermont

[Bug target/51837] Use of result from 64*64->128 bit multiply via __uint128_t not optimized

Reply via email to