[go-nuts] Re: [ANN] primegen.go: Sieve of Atkin prime number generator

gordonbgood Wed, 15 Jun 2016 05:37:59 -0700

While I am sure you are correct that some numeric code with golang can run as 
fast as C/C++, it seems that bit-packed tight loops such as used for culling 
composites in the Sieve of Eratosthenes (or the Sieve of Atkin for that matter) 
are not as optimized.  Of course one of the problems that golang has is 
compulsory array bounds checks, but it doesn't run as fast as even DotNet or 
JVM code, which also have those compulsory checks.


To be specific, following is a simple Sieve of Eratosthenes program in golang 
that runs about twice as slow as the same program in C/C++ and about half again 
as slow as in C# or Java.  The range of 262146 is chosen so that the created 
bit-packed array is smaller than the size of most CPU L1 caches (16 Kilobytes) 
so that memory access time isn't a bottleneck for most modern desktop CPU's, 
and the program is run 1000 times to get a reasonable length of time for more 
accurate timing  The commented out lines were an experiment to use unsafe 
pointers (as in a C style array program) to try to save time by avoiding the 
automatic array bounds checks, but the resulting time was about the same:

package main

import (
        "fmt"
        "math"
        "time"
        //    "unsafe"
)

func mkCLUT() [65536]byte {
        var arr [65536]byte
        for i := 0; i < 65536; i++ {
                var cnt byte = 0
                for v := (uint16)(i ^ 0xFFFF); v > 0; v &= v - 1 {
                        cnt++
                }
                arr[i] = cnt
        }
        return arr
}

var cnstCLUT [65536]byte = mkCLUT()

func primesTest(top uint) int {
        lmtndx := (top - 3) >> 1
        lstw := lmtndx >> 5
        topsqrtndx := (int(math.Sqrt(float64(top))) - 3) >> 1
        cmpsts := make([]uint32, lstw+1)
//        start := uintptr(unsafe.Pointer(&cmpsts[0]))
//        step := unsafe.Sizeof(buf[0])
//        limit := start + uintptr(len(cmpsts))*step
        for i := 0; i <= topsqrtndx; i++ {
                if cmpsts[i>>5]&(uint32(1)<<uint(i)) == 0 {
                        p := (uint(i) << 1) + 3
                        for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
                                cmpsts[j>>5] |= 1 << (j & 31)
                        }
//                        p := uintptr((uint(i) << 1) + 3)
//                       lmt := uintptr(lmtndx)
//                       for j := (p*p - 3) >> 1; j <= lmt; j += p {
//                         *(*uint)(unsafe.Pointer(start + step*(j>>5))) |= 1 
<< (j & 31)
//                       }
                }
        }
        msk := uint32(0xFFFFFFFE) << (lmtndx & 31)
        cmpsts[lstw] |= msk
        cnt := 1
        for i := uint(0); i <= lstw; i++ {
                v := cmpsts[i]
                cnt += int(cnstCLUT[v&0xFFFF] + cnstCLUT[0xFFFF&(v>>16)])
        }
        return cnt
}

func main() {
        n := uint(262146)

        strt := time.Now()

        sum := 0
        for i := 0; i < 1000; i++ {
                sum += primesTest(n)
        }

        end := time.Now()

        fmt.Println("Found", sum, "primes up to", n, "in", end.Sub(strt), ".")
}

The assembly listing revealed by "go tool compile -S primestest.go > 
primestest.s" reveals why (I have added comments at the ends of the lines to 
indicate what is going on):

line 74            for j := (p*p - 3) >> 1; j <= lmtndx; j += p {
line 75                cmpsts[j>>5] |= 1 << (j & 31)
line 76            }

has the following assembly code:

        0x011e 00286 (primestest.go:75) MOVQ    AX, R9 ;; make a copy of 'j'
        0x0121 00289 (primestest.go:75) SHRQ    $5, R9  ;; turn it into the 
uint32 word address
        0x0125 00293 (primestest.go:75) CMPQ    R9, SI  ;; do the array bounds 
check
        0x0128 00296 (primestest.go:75) JCC     $1, 546        ;; out to panic 
if it fails
        0x012e 00302 (primestest.go:75) LEAQ    (DI)(R9*4), BX ;; get address 
of right element of array; the di register contains the pointer to the array 
base
        0x0132 00306 (primestest.go:75) MOVL    (BX), R11         ;; get the 
contents of that element
        0x0135 00309 (primestest.go:75) CMPQ    R9, SI  ;; DO ANOTHER ARRAY 
BOUNDS CHECK!!!!!
        0x0138 00312 (primestest.go:75) JCC     $1, 539        ;; OUT TO A 
DIFFERENT PANIC IF IT FAILS!!!
        0x013e 00318 (primestest.go:75) LEAQ    (DI)(R9*4), BX ;; GET THE 
ADDRESS OF THE SAME ELEMENT AGAIN!!!!
        0x0142 00322 (primestest.go:75) MOVQ    CX, R8 ;;SAVE THE CONTENTS OF 
THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
        0x0145 00325 (primestest.go:75) MOVQ    AX, CX
        0x0148 00328 (primestest.go:75) ANDQ    $31, CX ;; cx register now 
contains 'j' & 31
        0x014c 00332 (primestest.go:75) MOVL    $1, BP
        0x0151 00337 (primestest.go:75) SHLL    CX, BP ;; bp register now 
contains 1 << ('j' & 31)
        0x0153 00339 (primestest.go:75) MOVQ    R8, CX  ;;RESTORE THE CONTENTS 
OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!!
        0x0156 00342 (primestest.go:75) ORL     R11, BP  ;; r11 now contains 
cmpsts['j' >> 5] | (1 << ('j' & 31))
        0x0159 00345 (primestest.go:75) MOVL    BP, (BX)  ;; move it back to 
the element via the address in BX
        0x015b 00347 (primestest.go:75) NOP  ;; nop's to make the addresses 
work out to even words
        0x015b 00347 (primestest.go:75) NOP
        0x015b 00347 (primestest.go:74) ADDQ    R10, AX ;; add 'j' to the 
stored 'p' in r10
        0x015e 00350 (primestest.go:74) NOP
        0x015e 00350 (primestest.go:74) CMPQ    AX, R12  ;; check if 'j' is up 
to the stored 'lmtndx' in r12
        0x0161 00353 (primestest.go:74) JLS     $0, 286 ;; and loop back to top 
if not done

The main problem is the lines I have commented in capitals where the bounds 
check is done twice but also
where the contents of the cx register are saved and restored inside the loop

A secondary problem as far as speed is concerned is that the C/C++ code also 
takes advantage of the direct read/modify/write instruction,
equivalent to "cmpsts[j >> 5] |= 1 << (j & 31)" so the code in this syntax for 
C/C++ code would become:

        0x011e 00286 (primestest.go:75) MOVQ    AX, R9 ;; make a copy of 'j'
        0x0121 00289 (primestest.go:75) SHRQ    $5, R9  ;; turn it into the 
uint32 word address
        0x0125 00293 (primestest.go:75) CMPQ    R9, SI  ;; do the array bounds 
check ONCE IF ONE MUST
        0x0128 00296 (primestest.go:75) JCC     $1, 546        ;; out to panic 
if it fails
        0x0145 00325 (primestest.go:75) MOVQ    AX, CX
        0x0148 00328 (primestest.go:75) ANDQ    $31, CX ;; cx register now 
contains 'j' & 31
        0x014c 00332 (primestest.go:75) MOVL    $1, BP
        0x0151 00337 (primestest.go:75) SHLL    CX, BP ;; bp register now 
contains 1 << ('j' & 31)
        0x015b 00347 (primestest.go:75) NOP ;; nop's to make the addresses work 
out to even words AS NECESSARY
        0x015b 00347 (primestest.go:75) NOP
        0x0156 00342 (primestest.go:75) ORL     BP, (DI)(R9*4);; the element 
now contains cmpsts['j' >> 5] | (1 << ('j' & 31))
        0x015b 00347 (primestest.go:74) ADDQ    R10, AX ;; add 'j' to the 
stored 'p' in r10
        0x015e 00350 (primestest.go:74) NOP
        0x015e 00350 (primestest.go:74) CMPQ    AX, R12  ;; check if 'j' is up 
to the stored 'lmtndx' in r12
        0x0161 00353 (primestest.go:74) JLS     $0, 286 ;; and loop back to top 
if not done

Note that NOP's do not take execution time as they are optimized away in a 
modern CPU and are there to
make destinations of jump instructions work to even word boundaries, which can 
make a difference in speed.

The immediately above loop will take about 3.5 clock cycles without bounds 
check and about 5.5 clock cycles with bounds check for a modern desktop CPU
whereas the former version will take about three clock cycles more.

This analysis is for 64-bit code, the difference is even worse for 32-bit code 
as there are less registers available and
the golang compiler is even worse at making best use of them.

For most use cases, the C/C++ code will be about twice the speed of the golang 
code.

C#/Java do the bounds checks, but not twice and are better at reducing register 
operations within tight loops.

In summary, the golang compiler has a way to go for maximum efficiency in 
register use, even if the array bounds checks are kept.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[go-nuts] Re: [ANN] primegen.go: Sieve of Atkin prime number generator

Reply via email to