I’m writing to alert the community to my intention to implement improvements to 
the calculations of basic block frequencies during loop unrolling 
optimizations.  My work is sponsored by IBM and is motivated by an observation 
that certain loop optimizations are being confused by the incorrect block 
frequencies that are being reported currently by gcc as a result of existing 
errors in the loop unrolling implementation.

Here’s a very simple program to demonstrate the existing problems:

void foo(double *d, unsigned long int n) {
  unsigned long int i;

  for (i=0;i<n;i++)
      d[i*2] =  0.0;
}

Compile using these options and examine the trace files (or insert your own 
instrumentation) to verify my findings.

        gcc -S -m64 -O3 -fno-tree-vectorize -funroll-loops --param 
max-unroll-times=4 -da loop.c

The trace file loop.c.212r.loop2_invariant displays the intermediate 
representation (rt.) immediately before loop unrolling and the trace file 
loop.c.213r.loop2_unroll shows the intermediate representation immediately 
following loop unrolling.

Before unrolling, we have the following (i’m providing meaningful names to some 
of the compiler-generated temporary variables):

 B0: /* the method prologue */
       /* fall through to B2 (100% probability) */

 B2: /* frequency: 900 */
       d[reg 183] = ap[48];  /* fetch incoming argument into register 183 */
       n[reg 184] = ap[56];
       ivtmp.6[reg 178] = d;
       end_of_d[reg 182] = ivtmp.6 + t;
       if (o < n) goto B3;   /* conditional jump to B3 with probability 91% */
       else goto B4;  /* “fall through" to B4 with probability 9% */

 B3: /* frequency: 819 */
       zero_preload[reg 187] = 0.0d;
       /* fall through to B5 with probability 100% */

        /* aside: the back edge of a loop is assumed to have probability 91%  
This has the
         * effect of treating the body of the loop as if its frequency is the 
incoming edge’s
         * frequency divided by 0.09.
         */
 B5: /* frequency 9100 = 819 (from B3) + 8281 (from B6) */
       mem[ivtmp.6] = zero_preload;
       ivtmp.6 += 16;
       if (ivtmp.6 != end_of_d) goto B6;  /* probability 91%, frequency 8281 = 
91% of 9100 */
       else goto B7;  /* probability 9%, frequency 819 = 9% of 9100 */

 B6: /* frequency 8281 */
       /* no code */
       goto B5;  /* "fall through" with 100% probability, frequency 8281 */

 B7: /* frequency 819 */
       /* no code */
       goto B4;  /* “fall through” with 100% probability, frequency 819 */

 B4: /* frequency 900 = 819 (from B7) + 81 (from B2) */
       /* no code */
       goto B1;  /* “fall through” with 100% probability, frequency 900 */

 B1: /* the method epilogue */

Note that for each basic block with multiple outgoing flows, the sum of the 
probabilities on those outgoing flows always equals 100%.

Also, for any block that has multiple predecessors, the sum of the frequencies 
associated with the incoming edges always equals the frequency of that block.

Here’s what the transformed program looks like following loop unrolling:

B0: /* the method prologue */
      /* fall through to B2 (100% probability) */

 B2: /* frequency 900 */
       _16 = n << 4;
       ivtmp.6 = d;
       _17 = ivtmp.6 + _16;
       if (0 < n) goto B3 /* fall-through, probability = 91%, frequency 819 */
       else goto B4; /* probability 9%, frequency 81 */

 B3: /* frequency 819 */
       zero_preload = 0.0;
       goto B8;  /* fall through, probability 100%, frequency 819 */

 B8: /* frequency 819 */
       d_length = _17 - ivtmp.6;
       extended_length = d_length - 16;
       extended_element_count = extended_length >> 4;
       mod4_elements = extended_element_count & 0x03;
      goto B9;  /* fall through, probability 100%, frequency 819 */

 B9: /* frequency 819 */
       *((double *) ivtmp.6) = zero_preload;
       ivtmp.6 += 16;
       if (ivtmp.6 == _17) goto B10; /* probability 91%, frequency 745 */
      else goto B7; /* fall through, probability 9%, frequency 74 */

 B10: /* frequency 745 */
         /* no code */
         goto BB23;  /* fall through, probability 75%, frequency 559 */
         /* WRONG: probability should be 100%, frequency should be 745 */

 B23: /* frequency 497: WRONG: B10 is the only predecessor, frequency should be 
559 based on existing B10 information */
        if (mod4_elements == 0) goto B22; /* 25% probability, frequency 124 */
        else goto BB19; /* fall through, probability 100%, frequency 497.  
WRONG: should be probability 75%, frequency 373 */

 B19: /* frequency: 373.  INCONSISTENT with B23 data, but this is a correct 
value by some standard. */
         if (mod4_elements == 1) goto BB18; /* 33% probability, frequency 123 */
         else goto B15; /* fall-through, probability 100%, frequency 373.  
WRONG: should be probability 67%, frequency 250 */

 B15: /* frequency: 745.  WRONG: only predecessor is B19, should be 373 as B19 
is currently described */
         if (mod4_elements == 2) goto B14; /* probability 50%, frequency = 373 
*/
         else goto B11; /* fall through, probability 100%, frequency 745.  
WRONG: should be probability 50%, frequency 373 */

 B11: /* frequency 745 */
         /* note: mod4_elements == 3 */
         goto B12;  /* fall through, probability 100%, frequency = 745 */

 B12: /* frequency 745 */
         *((double *) ivtmp.6) = zero_preload;
         ivtmp.6 += 16;
         goto B13;  /* fall through, probability 100%, frequency 745 */

 B13: /* frequency 745 */
          /* no code */
          goto B14; */ fall through, probability 100%, frequency 745 */

 B14: /* frequency 745  WRONG: predecessors are B13 and B15, should be 1118 = 
745 + 373 */
         /* no code */
         goto B16;  /* fall through, probability 100%, frequency 745 */

 B16: /* frequency 745 */
         *((double *) ivtmp.6) = zero_preload;
         ivtmp.6 += 16;
         goto B17; /* fall through, probability 100%, frequency 745 */

 B17: /* frequency 745 */
         goto B18; /* fall through, probability 100%, frequency 745 */

 B18: /* frequency 745. WRONG: predecessors are B19 and B17.  should be 868 = 
123 + 745 */
         goto B20; /* fall through, probability 100%, frequency 745 */

 B20: /* frequency 745 */
         *((double *) ivtmp.6) = zero_preload;
         ivtmp.6 += 16;
         if (ivtmp.6 != _17) goto B21;  /* probability 91%, frequency 678 */
         else goto B7; /* fall through, 9%, frequency 67 */

 B21: /* frequency 678 */
         /* no code */
         goto B22; /* fall through, probability 100%, frequency 678 */

 B22: /* frequency 678. WRONG: should be 802 = 678 (predecessor B21) + 124 
(predecessor B23) */
         goto B5; /* fall through, probability 100%, frequency 678 */

 B24: /* frequency 1884 */
         *((double *) ivtmp.6) = zero_preload;
         ivtmp.6 = unrolled_ivtmp_base + 16;
         goto B25; /* fall through, probability 100%, frequency 1884 */

 B25: /* frequency 1884 */
         /* no code */
         goto B26; /* fall through, probability 100%, frequency 1884 */

 B26: /* frequency 1884 */
         *((double *) ivtmp.6) = zero_preload;
         ivtmp.6 = unrolled_ivtmp_base + 32;
         goto B27;

 B27: /* frequency 1884 */
         /* no code */
         goto B28; /* fall through, probability 100%, frequency 1884 */

 B28: /* frequency 1884 */
         *((double *) ivtmp.6) = zero_preload;
         ivtmp.6 = unrolled_ivtmp_base + 48;
         if (ivtmp.6 != _17) goto B29; /* probability 91%, frequency 1715 */
         else goto B7;  /* fall through, probability 9%, frequency 170 */

 B29: /* frequency 1715 */
         /* no code */
         goto B5; /* fall through, probability 100%, frequency 1715 */

 B5: /* frequency 1884. WRONG: should be 2393 = 678 (predecessor B22) + 1715 
(predecessor B29) */
       /* Note: as represented currently, the frequency of the original loop 
body has been divided by 4 as a consequence of
       * unrolling the loop 4 times.  As part of our planned improvements, we 
intend to not divide the original frequency by
       * the loop unroll factor.  This is because the same heuristic for 
calculating loop body frequencies can be applied both
       * to the original and unrolled loop.  Also, it has been our experience 
that dividing the loop body frequencies by the loop
       * unroll factor causes certain subsequent optimization passes to not 
recognize the unrolled loop body as “hot code”
       * which is deserving of further optimization.
       *
       * When we eventually get around to doing similar improvements for the 
count field associated with basic blocks, we would
       * expect to scale count values according to the loop unroll factor.  
This is because count values are based on actual
       * measurements rather than heuristic approximations.
       */
       *((double *) ivtmp.6) = zero_preload;
       unrolled_ivtmp_base = ivtmp.6 + 16;
       ivtmp.6 = unrolled_ivtmp_base;
       goto B6; /* fall through, probability 100%, frequency 1884 */

 B6: /* frequency 1884 */
       /* no code */
       goto B29; /* fall through, probability 100%, frequency 1884 */

 B7: /* frequency 819. WRONG: should be 311 = 67 (predecessor B20) + 170 
(predecessor B28) + 74 (predecessor B9) */
       /* Note: by a different argument, 819 is the right value, because this 
is the only exit from the loop, and there are 819
        * entries into the loop.
        */
       /* no code */
       goto B4; /* fall through, probability 100%, frequency 819 */

 B4: /* frequency 900 */
       /* no code */
       goto B1; /* fall through, probability 100%, frequency 900 */

 B1: /* method epilogue */

Though this very simple program is sufficient to demonstrate the problems, I 
will be testing my solution with a variety of other test programs to make sure 
I handle distinctions between loops that iterate over a constant number of 
iterations vs. those that iterate over a number of iterations known only at run 
time, and also will look at various conditional control structures embedded 
within the loops, and will also look at an inner-nested loop, and at loops with 
multiple exits.  I’m guessing that loops with multiple entries are not 
unrolled, but I’ll confirm this with testing and make sure that the 
implementation does not explode in the presence of unnatural loops.

At the moment, I am not planning to make any changes to the maintenance of the 
count fields associated with each basic block during loop unrolling.  After the 
frequency fixes are completed, tested, and integrated, I plan to examine 
whether accounting associated with the count fields also deserves attention, 
and will address that separately.

Also, we have noticed that there are some opportunities to improve the code 
generation templates that are used for unrolling loops and may address those 
with a future patch, but I am not planning to make any changes to code 
generation at the current time.

I invite any guidance or suggestions on how to proceed.  I’m also hoping that 
if someone else is already in the middle of addressing these shortcomings, 
you’ll alert me so that we can avoid redundant efforts.

Thanks much.



Reply via email to