http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50101
Bug #: 50101
Summary: GCC 4.5 and 4.6 generate suboptimal code on ppc for
countdown loops when the CTR register cannot be used
Classification: Unclassified
Product: gcc
Version: 4.6.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
AssignedTo: [email protected]
ReportedBy: [email protected]
Host: powerpc64-linux
Target: powerpc64-linux
Build: powerpc64-linux
When GCC switched over to the IRA register allocator in GCC 4.5, it made some
loops run slower on the PowerPC. In particular, the powerpc has a count down
register (CTR) that the compiler can use with the -fbranch-count-reg
optimization. However, if the CTR register is not available in the loop, the
compiler does not use a GPR register for the loop index, but instead loads the
index value from memory, increments it, and stores it back to the stack.
For example, in the code:
int code[65536];
mike()
{
int j;
long addr;
for (j = 0; j < 65536; j+=4) {
asm("mtctr %1" : "=c" (addr) : "r" (&code[j]));
asm("bctrl" : : "c" (addr) : "lr" );
}
}
It generates the following on 4.3 (Sles 11SP1 host compiler):
.L.mike:
mflr 0
ld 9,.LC0@toc(2)
li 11,16384
std 0,16(1)
.p2align 4,,15
.L2:
#APP
# 10 "test-ppc-ctr.c" 1
mtctr 9
# 0 "" 2
# 11 "test-ppc-ctr.c" 1
bctrl
# 0 "" 2
#NO_APP
addic. 11,11,-1
addi 9,9,16
bne 0,.L2
ld 0,16(1)
mtlr 0
blr
If I go to a 4.4 based compiler such as the RHEL6 host compiler I get:
.L.mike:
mflr 0
ld 9,.LC0@toc(2)
std 0,16(1)
li 0,16384
std 0,-16(1)
.p2align 4,,15
.L2:
#APP
# 10 "test-ppc-ctr.c" 1
mtctr 9
# 0 "" 2
# 11 "test-ppc-ctr.c" 1
bctrl
# 0 "" 2
#NO_APP
ld 0,-16(1)
addi 9,9,16
addic. 11,0,-1
std 11,-16(1)
bne 0,.L2
ld 0,16(1)
mtlr 0
blr
Notice that it stores and loads the loop index value. If I use
-fno-branch-count-reg, it generates code to use the GPRS:
.L.mike:
mflr 0
ld 9,.LC0@toc(2)
std 0,16(1)
addis 0,9,0x4
.p2align 4,,15
.L2:
#APP
# 10 "test-ppc-ctr.c" 1
mtctr 9
# 0 "" 2
# 11 "test-ppc-ctr.c" 1
bctrl
# 0 "" 2
#NO_APP
addi 9,9,16
cmpd 7,9,0
bne 7,.L2
ld 0,16(1)
mtlr 0
blr
This is fixed in the GCC 4.7 development sources. The development source
revision that fixed this was subversion id 171649, created on March 28th, 2011
by Vladimir Makarov <[email protected]>, in his large rewrite of the ira
register allocator.
As an experiment, I built the Spec 2006 benchmark suite with
-fno-branch-count-reg. As expected, there are a number of benchmarks that
regress if the count register optimization, but there are a few benchmarks that
get a large speed up by disabling this optimization, which probably indicates
they are being mis-optimized. The benchmarks with the speedup include:
464.h264ref (19.65% improvement), 434.zeusmp (17.92% improvement) and
459.GemsFDTD (13.02% improvement).