Further to the subject of compiler efficiency, the following is the assembler 
code output with array bounds checking turned off (-B) the the inner tight 
composite culling loop of FasterEratspeed above (generated with go tool compile 
-B -S FasterEratspeed.go > FasterEratspeed.asm):

        0x0051 00081 (main.go:426)      MOVL    R11, CX
        0x0054 00084 (main.go:426)      SHRL    $5, R11
        0x0058 00088 (main.go:428)      MOVL    (R9)(R11*4), R13
        0x005c 00092 (main.go:430)      MOVL    $1, R14
        0x0062 00098 (main.go:430)      SHLL    CX, R14
        0x0065 00101 (main.go:430)      ORL     R13, R14
        0x0068 00104 (main.go:431)      MOVL    R14, (R9)(R11*4)
        0x006c 00108 (main.go:429)      LEAL    (R12)(CX*1), R11
        0x0070 00112 (main.go:425)      CMPL    R11, R8
        0x0073 00115 (main.go:425)      JCS     $0, 81

At 10 instructions, this is about as tight as it gets other than for using the 
more complex read/modify/write version of the ORL instruction, but that doesn't 
seem to save much if any time given instruction latencies.  Note that this code 
has eliminated the "k & 31" for the shift, seeming to recognize that it isn't 
necessary as a long shift can't be greater than 31 anyway, that unlike the 
simple PrimeSpeed program, this properly uses the immediate load of '1', that 
it cleverly uses the LEAL instruction to add the prime value 'q' in R12 to the 
unmodified 'k' value in CX to produce the sum to the original location of 'j' 
in R11 to save another instruction to move the results from CX to R11.

This uses a total of 7 registers being CX, R8, R9, R11, R12, R13, and R14, so 
it would be possible to make it work and run just as fast for x86 code.  This 
includes one register to hold the prime value 'q' in R12, and one register to 
hold the loop limit 'lngthb] in R8.  An array bounds check requires another 
register to hold the array upper bound, which won't be hard for x64 code, but 
will slow x86 code as there won't be enough available registers unless the 
compiler can be made smart enough for certain loop formats to recognize that 
the check isn't necessary, being inherent to the loop check.  Pretty impressive 
if all loops could be compiled as well as this, as this will run at very close 
to C/C++ speed!

When run within L1 cache on a modern Intel CPU without memory or execution 
pipeline(s) bottle-necks, this code will run in about 3.5 CPU cycles per loop 
or about 2.86 instructions per cycle.

I'm looking for the reason that this code is so efficient and the former simple 
PrimeSpeed code was so poor; one clue might be that I used native uint's and 
int's for PrimeSpeed, which are 64-bit on an amd version of the compiler for 
that code so the code was using 'Q' suffix instructions that perhaps it doesn't 
use as efficiently where uint32's and int32's are used here.

Can the compiler guys contribute anything on how to format tight loop code so 
it compiles as well as this?  If the crucial parts of the compiler and 
libraries were written to compile to code as efficient as this, I don't think 
anyone would ahve problems with golang's speed.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to