While I am sure you are correct that some numeric code with golang can run as fast as C/C++, it seems that bit-packed tight loops such as used for culling composites in the Sieve of Eratosthenes (or the Sieve of Atkin for that matter) are not as optimized. Of course one of the problems that golang has is compulsory array bounds checks, but it doesn't run as fast as even DotNet or JVM code, which also have those compulsory checks.
To be specific, following is a simple Sieve of Eratosthenes program in golang that runs about twice as slow as the same program in C/C++ and about half again as slow as in C# or Java. The range of 262146 is chosen so that the created bit-packed array is smaller than the size of most CPU L1 caches (16 Kilobytes) so that memory access time isn't a bottleneck for most modern desktop CPU's, and the program is run 1000 times to get a reasonable length of time for more accurate timing The commented out lines were an experiment to use unsafe pointers (as in a C style array program) to try to save time by avoiding the automatic array bounds checks, but the resulting time was about the same: package main import ( "fmt" "math" "time" // "unsafe" ) func mkCLUT() [65536]byte { var arr [65536]byte for i := 0; i < 65536; i++ { var cnt byte = 0 for v := (uint16)(i ^ 0xFFFF); v > 0; v &= v - 1 { cnt++ } arr[i] = cnt } return arr } var cnstCLUT [65536]byte = mkCLUT() func primesTest(top uint) int { lmtndx := (top - 3) >> 1 lstw := lmtndx >> 5 topsqrtndx := (int(math.Sqrt(float64(top))) - 3) >> 1 cmpsts := make([]uint32, lstw+1) // start := uintptr(unsafe.Pointer(&cmpsts[0])) // step := unsafe.Sizeof(buf[0]) // limit := start + uintptr(len(cmpsts))*step for i := 0; i <= topsqrtndx; i++ { if cmpsts[i>>5]&(uint32(1)<<uint(i)) == 0 { p := (uint(i) << 1) + 3 for j := (p*p - 3) >> 1; j <= lmtndx; j += p { cmpsts[j>>5] |= 1 << (j & 31) } // p := uintptr((uint(i) << 1) + 3) // lmt := uintptr(lmtndx) // for j := (p*p - 3) >> 1; j <= lmt; j += p { // *(*uint)(unsafe.Pointer(start + step*(j>>5))) |= 1 << (j & 31) // } } } msk := uint32(0xFFFFFFFE) << (lmtndx & 31) cmpsts[lstw] |= msk cnt := 1 for i := uint(0); i <= lstw; i++ { v := cmpsts[i] cnt += int(cnstCLUT[v&0xFFFF] + cnstCLUT[0xFFFF&(v>>16)]) } return cnt } func main() { n := uint(262146) strt := time.Now() sum := 0 for i := 0; i < 1000; i++ { sum += primesTest(n) } end := time.Now() fmt.Println("Found", sum, "primes up to", n, "in", end.Sub(strt), ".") } The assembly listing revealed by "go tool compile -S primestest.go > primestest.s" reveals why (I have added comments at the ends of the lines to indicate what is going on): line 74 for j := (p*p - 3) >> 1; j <= lmtndx; j += p { line 75 cmpsts[j>>5] |= 1 << (j & 31) line 76 } has the following assembly code: 0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make a copy of 'j' 0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn it into the uint32 word address 0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do the array bounds check 0x0128 00296 (primestest.go:75) JCC $1, 546 ;; out to panic if it fails 0x012e 00302 (primestest.go:75) LEAQ (DI)(R9*4), BX ;; get address of right element of array; the di register contains the pointer to the array base 0x0132 00306 (primestest.go:75) MOVL (BX), R11 ;; get the contents of that element 0x0135 00309 (primestest.go:75) CMPQ R9, SI ;; DO ANOTHER ARRAY BOUNDS CHECK!!!!! 0x0138 00312 (primestest.go:75) JCC $1, 539 ;; OUT TO A DIFFERENT PANIC IF IT FAILS!!! 0x013e 00318 (primestest.go:75) LEAQ (DI)(R9*4), BX ;; GET THE ADDRESS OF THE SAME ELEMENT AGAIN!!!! 0x0142 00322 (primestest.go:75) MOVQ CX, R8 ;;SAVE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!! 0x0145 00325 (primestest.go:75) MOVQ AX, CX 0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx register now contains 'j' & 31 0x014c 00332 (primestest.go:75) MOVL $1, BP 0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp register now contains 1 << ('j' & 31) 0x0153 00339 (primestest.go:75) MOVQ R8, CX ;;RESTORE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!! 0x0156 00342 (primestest.go:75) ORL R11, BP ;; r11 now contains cmpsts['j' >> 5] | (1 << ('j' & 31)) 0x0159 00345 (primestest.go:75) MOVL BP, (BX) ;; move it back to the element via the address in BX 0x015b 00347 (primestest.go:75) NOP ;; nop's to make the addresses work out to even words 0x015b 00347 (primestest.go:75) NOP 0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add 'j' to the stored 'p' in r10 0x015e 00350 (primestest.go:74) NOP 0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; check if 'j' is up to the stored 'lmtndx' in r12 0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and loop back to top if not done The main problem is the lines I have commented in capitals where the bounds check is done twice but also where the contents of the cx register are saved and restored inside the loop A secondary problem as far as speed is concerned is that the C/C++ code also takes advantage of the direct read/modify/write instruction, equivalent to "cmpsts[j >> 5] |= 1 << (j & 31)" so the code in this syntax for C/C++ code would become: 0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make a copy of 'j' 0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn it into the uint32 word address 0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do the array bounds check ONCE IF ONE MUST 0x0128 00296 (primestest.go:75) JCC $1, 546 ;; out to panic if it fails 0x0145 00325 (primestest.go:75) MOVQ AX, CX 0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx register now contains 'j' & 31 0x014c 00332 (primestest.go:75) MOVL $1, BP 0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp register now contains 1 << ('j' & 31) 0x015b 00347 (primestest.go:75) NOP ;; nop's to make the addresses work out to even words AS NECESSARY 0x015b 00347 (primestest.go:75) NOP 0x0156 00342 (primestest.go:75) ORL BP, (DI)(R9*4);; the element now contains cmpsts['j' >> 5] | (1 << ('j' & 31)) 0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add 'j' to the stored 'p' in r10 0x015e 00350 (primestest.go:74) NOP 0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; check if 'j' is up to the stored 'lmtndx' in r12 0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and loop back to top if not done Note that NOP's do not take execution time as they are optimized away in a modern CPU and are there to make destinations of jump instructions work to even word boundaries, which can make a difference in speed. The immediately above loop will take about 3.5 clock cycles without bounds check and about 5.5 clock cycles with bounds check for a modern desktop CPU whereas the former version will take about three clock cycles more. This analysis is for 64-bit code, the difference is even worse for 32-bit code as there are less registers available and the golang compiler is even worse at making best use of them. For most use cases, the C/C++ code will be about twice the speed of the golang code. C#/Java do the bounds checks, but not twice and are better at reducing register operations within tight loops. In summary, the golang compiler has a way to go for maximum efficiency in register use, even if the array bounds checks are kept. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.