Further to the subject of compiler efficiency, the following is the assembler code output with array bounds checking turned off (-B) the the inner tight composite culling loop of FasterEratspeed above (generated with go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm):
0x0051 00081 (main.go:426) MOVL R11, CX 0x0054 00084 (main.go:426) SHRL $5, R11 0x0058 00088 (main.go:428) MOVL (R9)(R11*4), R13 0x005c 00092 (main.go:430) MOVL $1, R14 0x0062 00098 (main.go:430) SHLL CX, R14 0x0065 00101 (main.go:430) ORL R13, R14 0x0068 00104 (main.go:431) MOVL R14, (R9)(R11*4) 0x006c 00108 (main.go:429) LEAL (R12)(CX*1), R11 0x0070 00112 (main.go:425) CMPL R11, R8 0x0073 00115 (main.go:425) JCS $0, 81 At 10 instructions, this is about as tight as it gets other than for using the more complex read/modify/write version of the ORL instruction, but that doesn't seem to save much if any time given instruction latencies. Note that this code has eliminated the "k & 31" for the shift, seeming to recognize that it isn't necessary as a long shift can't be greater than 31 anyway, that unlike the simple PrimeSpeed program, this properly uses the immediate load of '1', that it cleverly uses the LEAL instruction to add the prime value 'q' in R12 to the unmodified 'k' value in CX to produce the sum to the original location of 'j' in R11 to save another instruction to move the results from CX to R11. This uses a total of 7 registers being CX, R8, R9, R11, R12, R13, and R14, so it would be possible to make it work and run just as fast for x86 code. This includes one register to hold the prime value 'q' in R12, and one register to hold the loop limit 'lngthb] in R8. An array bounds check requires another register to hold the array upper bound, which won't be hard for x64 code, but will slow x86 code as there won't be enough available registers unless the compiler can be made smart enough for certain loop formats to recognize that the check isn't necessary, being inherent to the loop check. Pretty impressive if all loops could be compiled as well as this, as this will run at very close to C/C++ speed! When run within L1 cache on a modern Intel CPU without memory or execution pipeline(s) bottle-necks, this code will run in about 3.5 CPU cycles per loop or about 2.86 instructions per cycle. I'm looking for the reason that this code is so efficient and the former simple PrimeSpeed code was so poor; one clue might be that I used native uint's and int's for PrimeSpeed, which are 64-bit on an amd version of the compiler for that code so the code was using 'Q' suffix instructions that perhaps it doesn't use as efficiently where uint32's and int32's are used here. Can the compiler guys contribute anything on how to format tight loop code so it compiles as well as this? If the crucial parts of the compiler and libraries were written to compile to code as efficient as this, I don't think anyone would ahve problems with golang's speed. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.