Using 4.4.0 gcc, I compiled a function and found it a tad long. The
command line is:

gcc -Os -mcpu=arm7tdmi-s -S func.c

although the output is pretty much the same with -O2 or -O3 as well (only
a few instructions longer).

The function is basically an unrolled 32 bit unsigned division by 1E9:

unsigned int divby1e9( unsigned int num, unsigned int *quotient )
{
unsigned int dig;
unsigned int tmp;
  tmp = 1000000000u;
  dig = 0;
  if ( num >= tmp ) {
     tmp <<= 2;
     if ( num >= tmp ) {
         num -= tmp;
         dig  = 4;
     }
     else {
         tmp >>= 1;
         if ( num >= tmp ) {
             num -= tmp;
             dig  = 2;
         }
         tmp >>= 1;
         if ( num >= tmp ) {
             num -= tmp;
             dig++;
         }
     }
   }
   *quotinet = dig;
   return num;
}

The compiler generated the following code:

divby1e9:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r3, .L10
        cmp     r0, r3
        movls   r3, #0
        bls     .L3
        ldr     r2, .L10+4
        cmp     r0, r2
        addhi   r0, r0, #293601280
        addhi   r0, r0, #1359872
        addhi   r0, r0, #6144
        movhi   r3, #4
        bhi     .L3
.L4:
        ldr     r2, .L10+8
        cmp     r0, r2
        movls   r3, #0
        bls     .L6
        add     r0, r0, #-2013265920
        add     r0, r0, #13238272
        add     r0, r0, #27648
        cmp     r0, r3
        movls   r3, #2
        bls     .L3
        mov     r3, #2
.L6:
        add     r0, r0, #-1006632960
        add     r0, r0, #6619136
        add     r0, r0, #13824
        add     r3, r3, #1
.L3:
        str     r3, [r1, #0]
        bx      lr
.L11:
        .align  2
.L10:
        .word   999999999
        .word   -294967297
        .word   1999999999


Note that it is sub-optimal on two counts.

First, each loading of a constant takes 3 instructions and 3 clocks.
Storing the constant and fetching it using an ldr also takes 3 clocks but
only two 32-bit words and identical constants need to be stored only once.
The speed increase is only true on the ARM7TDMI-S, which has no caches, so
that's just a minor issue, but the memory saving is true no matter what
ARM core you have (note that -Os was specified).

Second, and this is the real problem, if the compiler did not want to be
overly clever and compiled the code as it was written, then instead of
loading the constants 4 times, at the cost of 3 instuctions each, it could
have loaded it only once and then generated the next constants at the cost
of a single-word, single clock shift. The code would have been rather
shorter *and* faster, plus some of the jumps could have been eliminated.
Practically each C statement line (except the braces) corresponds to one
assembly instruction, so without being clever, just translating what's
written, it could be done in 20 words instead of 30.

Is it a problem that is worth being put onto bugzilla or I just have to do
some trickery to save the compiler from being smarter than it is?

Zoltan


Reply via email to