http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50101

             Bug #: 50101
           Summary: GCC 4.5 and 4.6 generate suboptimal code on ppc for
                    countdown loops when the CTR register cannot be used
    Classification: Unclassified
           Product: gcc
           Version: 4.6.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
        AssignedTo: [email protected]
        ReportedBy: [email protected]
              Host: powerpc64-linux
            Target: powerpc64-linux
             Build: powerpc64-linux


When GCC switched over to the IRA register allocator in GCC 4.5, it made some
loops run slower on the PowerPC.  In particular, the powerpc has a count down
register (CTR) that the compiler can use with the -fbranch-count-reg
optimization.  However, if the CTR register is not available in the loop, the
compiler does not use a GPR register for the loop index, but instead loads the
index value from memory, increments it, and stores it back to the stack.

For example, in the code:

int code[65536];

mike()
{
  int j;
  long addr;

  for (j = 0; j < 65536; j+=4) {
    asm("mtctr %1" : "=c" (addr) : "r" (&code[j]));
    asm("bctrl" : : "c" (addr) : "lr" );
  }
}

It generates the following on 4.3 (Sles 11SP1 host compiler):

.L.mike:
        mflr 0
        ld 9,.LC0@toc(2)
        li 11,16384
        std 0,16(1)
        .p2align 4,,15
.L2:
#APP
 # 10 "test-ppc-ctr.c" 1
        mtctr 9
 # 0 "" 2
 # 11 "test-ppc-ctr.c" 1
        bctrl
 # 0 "" 2
#NO_APP
        addic. 11,11,-1
        addi 9,9,16
        bne 0,.L2
        ld 0,16(1)
        mtlr 0
        blr

If I go to a 4.4 based compiler such as the RHEL6 host compiler I get:

.L.mike:
        mflr 0
        ld 9,.LC0@toc(2)
        std 0,16(1)
        li 0,16384
        std 0,-16(1)
        .p2align 4,,15
.L2:
#APP
 # 10 "test-ppc-ctr.c" 1
        mtctr 9
 # 0 "" 2
 # 11 "test-ppc-ctr.c" 1
        bctrl
 # 0 "" 2
#NO_APP
        ld 0,-16(1)
        addi 9,9,16
        addic. 11,0,-1
        std 11,-16(1)
        bne 0,.L2
        ld 0,16(1)
        mtlr 0
        blr

Notice that it stores and loads the loop index value.  If I use
-fno-branch-count-reg, it generates code to use the GPRS:

.L.mike:
        mflr 0
        ld 9,.LC0@toc(2)
        std 0,16(1)
        addis 0,9,0x4
        .p2align 4,,15
.L2:
#APP
 # 10 "test-ppc-ctr.c" 1
        mtctr 9
 # 0 "" 2
 # 11 "test-ppc-ctr.c" 1
        bctrl
 # 0 "" 2
#NO_APP
        addi 9,9,16
        cmpd 7,9,0
        bne 7,.L2
        ld 0,16(1)
        mtlr 0
        blr

This is fixed in the GCC 4.7 development sources.  The development source
revision that fixed this was subversion id 171649, created on March 28th, 2011
by Vladimir Makarov  <[email protected]>, in his large rewrite of the ira
register allocator.

As an experiment, I built the Spec 2006 benchmark suite with
-fno-branch-count-reg.  As expected, there are a number of benchmarks that
regress if the count register optimization, but there are a few benchmarks that
get a large speed up by disabling this optimization, which probably indicates
they are being mis-optimized.  The benchmarks with the speedup include:
464.h264ref (19.65% improvement), 434.zeusmp (17.92% improvement) and
459.GemsFDTD (13.02% improvement).

Reply via email to