(Cross-posted in case there are generic issues; please trim if
discussion wanders into single-architecture details.)

I was working on some scaling code that can benefit from 64x64->128-bit
multiplies.  GCC supports an __int128 type on processors with hardware
support (including z/Arch and MIPS64), but the support was broken on
early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.

Currently, of the ten 64-bit architectures Linux supports, that's
only enabled on x86, ARM, and RISC-V.

SPARC and HP-PA don't have support.

But that leaves Alpha, Mips, PowerPC, and S/390x.

Current mips64, powerpc64, and s390x gcc seems to generate sensible code
for mul_u64_u64_shr() in <linux/math64.h> if I cross-compile them.

I don't have easy access to an Alpha cross-compiler to test, but
as it has UMULH, I suspect it would work, too.

Is there a reason it hasn't been enabled on these platforms?

There might be a MIPS64r6 issue, since r6 changed from DMULTU
writing the lo and hi registers to DMULU/DMUHU, and gcc 8.3, at
least, doesn't know how to generate inline code for the latter.

(Note that users *also* check __INT128__, which is defined if GCC
claims to support __int128, so you don't have to worry about 32-bit
compiles or ancient compilers.  It only has to be conditional on
*broken* support.)


FWIW, the code I'm working on has this inner loop:
(https://arxiv.org/abs/1805.10941 for details)

u64 get_random_u64(void);
u64 get_random_max64(u64 range, u64 lim)
{
        unsigned __int128 prod;
        do {
                prod = (unsigned __int128)get_random_u64() * range;
        } while (unlikely((u64)prod < lim));
        return prod >> 64;
}

Which turns into these inner loops:
MIPS:
.L7:
        jal     get_random_u64
        nop
        dmultu $2,$17
        mflo    $3
        sltu    $4,$3,$16
        bne     $4,$0,.L7
        mfhi    $2

PowerPC:
.L9:
        bl get_random_u64
        nop
        mulld 9,3,31
        mulhdu 3,3,31
        cmpld 7,30,9
        bgt 7,.L9

s/390:
.L13:
        brasl   %r14,get_random_u64@PLT
        lgr     %r5,%r2
        mlgr    %r4,%r10
        lgr     %r2,%r4
        clgr    %r11,%r5
        jh      .L13

I like that the MIPS code leaves the high half of the product in
the hi register until it tests the low half; I wish PowerPC would
similarly move the mulhdu *after* the loop, like the following
hypothetical MIPS R6 code:

.L7:
        balc    get_random_u64
        dmulu   $3, $2, $17
        sltu    $3, $3, $16
        bnezc   $3, .L7
        dmuhu   $2, $2, $17

Or this handwritten Alpha code:
1:
        bsr     $26, get_random_u64
        mulq    $0, $9, $1      # $9 is range
        cmpult  $1, $10, $1     # $10 is lim
        bne     $1, 1b
        umulh   $0, $9, $0

Reply via email to