https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062
Bug ID: 102062
Summary: powerpc suboptimal unrolling simple array sum
Product: gcc
Version: 11.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: npiggin at gmail dot com
Target Milestone: ---
Target: powerpc64le-linux-gnu
--- test.c ---
int test(int *arr, int sz)
{
int ret = 0;
int i;
if (sz < 1)
__builtin_unreachable();
for (i = 0; i < sz*2; i++)
ret += arr[i];
return ret;
}
---
gcc-11 compiles this to:
test:
rldic 4,4,1,32
addi 10,3,-4
rldicl 9,4,63,33
li 3,0
mtctr 9
.L2:
addi 8,10,4
lwz 9,4(10)
addi 10,10,8
lwz 8,4(8)
add 9,9,3
add 9,9,8
extsw 3,9
bdnz .L2
blr
I may be unaware of a constraint of C standard here, but maintaining the two
base addresses seems pointless, so is beginning the first at offset -4.
The bigger problem is keeping a single sum. Keeping two sums and adding them at
the end reduces critical latency of the loop from 6 to 2, which brings
throughput on large loops from 6 cycles per iteration down to about 2.2 on
POWER9 without harming short loops:
test:
rldic 4,4,1,32
rldicl 9,4,63,33
mtctr 9
li 8,0
li 9,0
.L2:
lwz 6,0(3)
lwz 7,4(3)
addi 3,3,8
add 8,8,6
add 9,9,7
bdnz .L2
add 9,9,8
extsw 3,9
blr
Any reason this can't be done?