https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106678
Bug ID: 106678
Summary: Inefficiency in long integer multiplication
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tkoenig at gcc dot gnu.org
Target Milestone: ---
The code from PR 103109
#include <stdint.h>
void Long_multiplication( uint64_t multiplicand[],
uint64_t multiplier[],
uint64_t sum[],
uint64_t ilength, uint64_t jlength )
{
uint64_t acarry, mcarry, product;
for( uint64_t i = 0;
i < (ilength + jlength);
i++ )
sum[i] = 0;
acarry = 0;
for( uint64_t j = 0; j < jlength; j++ )
{
mcarry = 0;
for( uint64_t i = 0; i < ilength; i++ )
{
__uint128_t mcarry_prod;
__uint128_t acarry_sum;
mcarry_prod = ((__uint128_t) multiplicand[i]) * ((__uint128_t)
multiplier[j])
+ (__uint128_t) mcarry;
mcarry = mcarry_prod >> 64;
product = mcarry_prod;
acarry_sum = ((__uint128_t) sum[i+j]) + ((__uint128_t) acarry) +
product;
sum[i+j] += acarry_sum;
acarry = acarry_sum >> 64;
// {mcarry, product} = multiplicand[i]*multiplier[j]
// + mcarry;
// {acarry,sum[i+j]} = {sum[i+j]+acarry} + product;
}
}
}
still shows some inefficiency after r13-2107.
Compiling the function with gcc 13.0.0 20220818, with
$ gcc -mcpu=power9 -O3 -c loop.c
and disassembling the output (for easier reading) gives (looking only
at the main part)
7c: 00 00 80 38 li r4,0
80: 00 00 80 3b li r28,0
84: 00 00 60 38 li r3,0
88: 00 00 00 38 li r0,0
8c: ff ff c0 38 li r6,-1
90: 00 00 e0 38 li r7,0
94: 20 00 c1 fa std r22,32(r1)
98: 28 00 e1 fa std r23,40(r1)
9c: 60 00 c1 fb std r30,96(r1)
a0: 68 00 e1 fb std r31,104(r1)
a4: 00 00 00 60 nop
a8: 00 00 00 60 nop
ac: 00 00 42 60 ori r2,r2,0
b0: a6 03 49 7f mtctr r26
b4: 78 c3 0c 7f mr r12,r24
b8: 14 22 b9 7c add r5,r25,r4
bc: 00 00 00 39 li r8,0
c0: 09 00 6c e9 ldu r11,8(r12)
c4: 2a 20 5d 7d ldx r10,r29,r4
c8: 09 00 25 e9 ldu r9,8(r5)
cc: 33 52 cb 13 maddld r30,r11,r10,r8
d0: 31 52 eb 13 maddhdu r31,r11,r10,r8
d4: 38 30 d6 7f and r22,r30,r6
d8: 38 38 f7 7f and r23,r31,r7
dc: 78 fb e8 7f mr r8,r31
e0: 14 48 56 7d addc r10,r22,r9
e4: 14 01 77 7d adde r11,r23,r0
e8: 14 18 4a 7d addc r10,r10,r3
ec: 14 52 29 7d add r9,r9,r10
f0: 94 01 6b 7c addze r3,r11
f4: 00 00 25 f9 std r9,0(r5)
f8: c8 ff 00 42 bdnz c0 <Long_multiplication+0xc0>
fc: 01 00 9c 3b addi r28,r28,1
100: 08 00 84 38 addi r4,r4,8
104: 40 e0 3b 7c cmpld r27,r28
108: a8 ff 82 40 bne b0 <Long_multiplication+0xb0>
In these two nested loops, r6 is not changed, so it is always -1.
d4: 38 30 d6 7f and r22,r30,r6
just assigns r30 to r22, so r30 could have been used instead of
r22.
Similarly,
d8: 38 38 f7 7f and r23,r31,r7
just sets r23 to zero because r7 is always zero.