Hi!

On Fri, Mar 29, 2019 at 01:07:07PM +0000, George Spelvin wrote:
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
> 
> Currently, of the ten 64-bit architectures Linux supports, that's
> only enabled on x86, ARM, and RISC-V.
> 
> SPARC and HP-PA don't have support.
> 
> But that leaves Alpha, Mips, PowerPC, and S/390x.
> 
> Current mips64, powerpc64, and s390x gcc seems to generate sensible code
> for mul_u64_u64_shr() in <linux/math64.h> if I cross-compile them.

Yup.

> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.

https://mirrors.edge.kernel.org/pub/tools/crosstool/

> u64 get_random_u64(void);
> u64 get_random_max64(u64 range, u64 lim)
> {
>       unsigned __int128 prod;
>       do {
>               prod = (unsigned __int128)get_random_u64() * range;
>       } while (unlikely((u64)prod < lim));
>       return prod >> 64;
> }

> Which turns into these inner loops:
> MIPS:
> .L7:
>       jal     get_random_u64
>       nop
>       dmultu $2,$17
>       mflo    $3
>       sltu    $4,$3,$16
>       bne     $4,$0,.L7
>       mfhi    $2
> 
> PowerPC:
> .L9:
>       bl get_random_u64
>       nop
>       mulld 9,3,31
>       mulhdu 3,3,31
>       cmpld 7,30,9
>       bgt 7,.L9
> 
> s/390:
> .L13:
>       brasl   %r14,get_random_u64@PLT
>       lgr     %r5,%r2
>       mlgr    %r4,%r10
>       lgr     %r2,%r4
>       clgr    %r11,%r5
>       jh      .L13
> 
> I like that the MIPS code leaves the high half of the product in
> the hi register until it tests the low half; I wish PowerPC would
> similarly move the mulhdu *after* the loop,

The MIPS code has the multiplication inside the loop as well, and even
the mfhi I think: MIPS has delay slots.

GCC treats the int128 as one register until it has expanded to RTL, and it
does not do such loop optimisations after that, apparently.

File a PR please?  https://gcc.gnu.org/bugzilla/


Segher

Reply via email to