Out of Order and Superscalar - small experiment

Rob van der Heij Mon, 02 Jun 2014 08:06:59 -0700

I've been reading about it, but mostly ignored it as not too important for
what I do. But...


More recently I've been working on porting Linux gcc object code to CMS,
and now that I needed a nice checksum routine, I figured I might take a
popular open source checksum routine http://en.wikipedia.org/wiki/Adler-32
and let gcc compile and optimize it. Since the generated assembler source
wasn't that obvious to me, I was getting interested to know why.

My simplistic implementation was like this (for each byte, so wrapped in a
loop)


*  IC        R4,0(R6)  AR        R2,R4     AR        R3,R2   *

The optimized gcc code was more like this (for 3 bytes)

*    IC    R2,1(R5)            LHI   R3,0                AR    R2,R1
              AR    R0,R1               IC    R3,2(R5)            LHI
  R1,0                AR    R3,R2               AR    R0,R2
   IC    R1,3(R5)            LHI   R2,0                AR    R1,R3
              AR    R0,R3           *
As I understand, the LHI is done earlier in the stream to allow overlap
with the other instructions. And it's flipping the role of the registers
with each pass. Thought it was pretty slick, and my test suggested that it
was almost twice as fast as my loop.

Next I unrolled the loop in my own code and did 16 bytes in a single pass,
50 instructions. The smart code is 62 instructions. Guess what: my simple
version was 50% faster than the smart code. So just because it looks
complicated does not make it faster! Very cute to see mine run with 1.7
instructions per cycle...Could it be that our CPU is more targeted at code
that was written by humans rather than optimzing compilers?

And you're right: my program needs the checksum over 500 byte or so, and
probably at most once per second. So it really does not matter. But it was
weekend...

Rob

Out of Order and Superscalar - small experiment

Reply via email to