I've been reading about it, but mostly ignored it as not too important for what I do. But...
More recently I've been working on porting Linux gcc object code to CMS, and now that I needed a nice checksum routine, I figured I might take a popular open source checksum routine http://en.wikipedia.org/wiki/Adler-32 and let gcc compile and optimize it. Since the generated assembler source wasn't that obvious to me, I was getting interested to know why. My simplistic implementation was like this (for each byte, so wrapped in a loop) * IC R4,0(R6) AR R2,R4 AR R3,R2 * The optimized gcc code was more like this (for 3 bytes) * IC R2,1(R5) LHI R3,0 AR R2,R1 AR R0,R1 IC R3,2(R5) LHI R1,0 AR R3,R2 AR R0,R2 IC R1,3(R5) LHI R2,0 AR R1,R3 AR R0,R3 * As I understand, the LHI is done earlier in the stream to allow overlap with the other instructions. And it's flipping the role of the registers with each pass. Thought it was pretty slick, and my test suggested that it was almost twice as fast as my loop. Next I unrolled the loop in my own code and did 16 bytes in a single pass, 50 instructions. The smart code is 62 instructions. Guess what: my simple version was 50% faster than the smart code. So just because it looks complicated does not make it faster! Very cute to see mine run with 1.7 instructions per cycle...Could it be that our CPU is more targeted at code that was written by humans rather than optimzing compilers? And you're right: my program needs the checksum over 500 byte or so, and probably at most once per second. So it really does not matter. But it was weekend... Rob
