I'm writing a simple benchmark to observe speedup with increased cores, I'm 
having an issue where the for core count of 1, it still uses GOMAXPROCS (in 
my case 4, hyper-threading turned off).

I'm thinking its just a flag is not set right, but I'm putting foward 
everything I know, because I think its still pretty weird for default 
behaviour.

I'm putting *1.* version and environment info, *2.* benchmark source code, 
*3.* benchmark results with instruction called, *4.* cpu info

*1.* version and environment info:
$ go version
go version go1.12.1 linux/amd64

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/point/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/point/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/lib/go"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 
-fdebug-prefix-map=/tmp/go-build577448577=/tmp/go-build 
-gno-record-gcc-switches"


*2.* benchmark source code
scale_test.go
package scale

import "testing"

func BenchmarkScale(b *testing.B) {
    for i := 0; i < b.N; i++ {
        scale(b)
    }
}

scale.go
package scale

import (
    "math"
    "runtime"
    "testing"
)

func scale(b *testing.B) {

    const max int = 12
    l := int(math.Pow(2, 30))
    NP := runtime.GOMAXPROCS(0)

    var a [max]int
    slice := max / NP

    done := make(chan bool)
    worker := func(P int, prod chan bool) {
        for i := P * slice; i < (P+1)*slice; i++ {
            for j := 0; j < l; j++ {
                a[i]++
            }
        }
        prod <- true
    }

    b.ResetTimer()

    for p := 0; p < NP; p++ {
        go worker(p, done)
    }

    for p := 0; p < NP; p++ {
        <-done
    }

}

*3.* benchmark results with instruction called

Ran with 1 core for multiple times, consistent, time as expected
$ go test -bench=. -cpu 1 -trace trace.out -count 5
goos: linux
goarch: amd64
BenchmarkScale            1    20830211783 ns/op
BenchmarkScale            1    20846636262 ns/op
BenchmarkScale            1    20913565823 ns/op
BenchmarkScale            1    20811254293 ns/op
BenchmarkScale            1    20819699212 ns/op
PASS
ok      _/Demo    104.225s

Similar for 2 cores, relatively reasonable timings
$ go test -bench=. -cpu 2 -trace trace.out -count 5
goos: linux
goarch: amd64
BenchmarkScale-2              1    12717158624 ns/op
BenchmarkScale-2              1    12811024997 ns/op
BenchmarkScale-2              1    12774018625 ns/op
BenchmarkScale-2              1    12770698880 ns/op
BenchmarkScale-2              1    12757689969 ns/op
PASS
ok      _/Demo    63.835s

Similar for 3 cores, relatively reasonable timings
$ go test -bench=. -cpu 3 -trace trace.out -count 5
goos: linux
goarch: amd64
BenchmarkScale-3              1    10820517159 ns/op
BenchmarkScale-3              1    10761484928 ns/op
BenchmarkScale-3              1    10730973532 ns/op
BenchmarkScale-3              1    10834395953 ns/op
BenchmarkScale-3              1    11004999207 ns/op
PASS
ok      _/Demo    54.158s

Similar for 4 cores, relatively reasonable timings
$ go test -bench=. -cpu 4 -trace trace.out -count 5
goos: linux
goarch: amd64
BenchmarkScale-4              1    8597505567 ns/op
BenchmarkScale-4              1    8523662979 ns/op
BenchmarkScale-4              1    8578160668 ns/op
BenchmarkScale-4              1    8585568729 ns/op
BenchmarkScale-4              1    8526519203 ns/op
PASS
ok      _/Demo    42.817s

Running all of them repeatedly, the first reading is similar to the 
readings a 4 cores, not consistent with those at 1 core
$ go test -bench=. -cpu 1,2,3,4 -trace trace.out -count 5
goos: linux
goarch: amd64
BenchmarkScale                1    8164198465 ns/op
BenchmarkScale                1    20826667395 ns/op
BenchmarkScale                1    20804635058 ns/op
BenchmarkScale                1    20794385469 ns/op
BenchmarkScale                1    20816750891 ns/op
BenchmarkScale-2              1    12706414818 ns/op
BenchmarkScale-2              1    12708898913 ns/op
BenchmarkScale-2              1    12770280120 ns/op
BenchmarkScale-2              1    12775216602 ns/op
BenchmarkScale-2              1    12636852361 ns/op
BenchmarkScale-3              1    10756340731 ns/op
BenchmarkScale-3              1    11045028650 ns/op
BenchmarkScale-3              1    11089450739 ns/op
BenchmarkScale-3              1    10771597158 ns/op
BenchmarkScale-3              1    10991225203 ns/op
BenchmarkScale-4              1    8142770549 ns/op
BenchmarkScale-4              1    8448329795 ns/op
BenchmarkScale-4              1    8436012552 ns/op
BenchmarkScale-4              1    8011377874 ns/op
BenchmarkScale-4              1    8069116380 ns/op
PASS
ok      _/Demo    250.779s

Obviously would exhibit itself better if run once
$ go test -bench=. -cpu 1,2,3,4 -trace trace.out -count 1
goos: linux
goarch: amd64
BenchmarkScale                1    8493942504 ns/op
BenchmarkScale-2              1    12755406326 ns/op
BenchmarkScale-3              1    11021525399 ns/op
BenchmarkScale-4              1    8535553001 ns/op
PASS
ok      _/Demo    40.812s

Tried not running only 1 parallel, still unreasonable
$ go test -bench=. -cpu 1,2,3,4 -trace trace.out -count 1 -parallel 1
goos: linux
goarch: amd64
BenchmarkScale                1    9133347693 ns/op
BenchmarkScale-2              1    12719535088 ns/op
BenchmarkScale-3              1    11691925150 ns/op
BenchmarkScale-4              1    8255112076 ns/op
PASS
ok      _/Demo    41.806s

Tried also running with -p 1, still unreasonable
$ go test -bench=. -cpu 1,2,3,4 -trace trace.out -count 1 -parallel 1 -p 1
goos: linux
goarch: amd64
BenchmarkScale                1    8773369668 ns/op
BenchmarkScale-2              1    12913794360 ns/op
BenchmarkScale-3              1    11678645565 ns/op
BenchmarkScale-4              1    8864651538 ns/op
PASS
ok      _/Demo    42.236s

*4.* cpu info

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       39 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Stepping:            3
CPU MHz:             2700.492
CPU max MHz:         3500.0000
CPU min MHz:         800.0000
BogoMIPS:            5186.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 
monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 
x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm 
abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp 
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 
smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt 
xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window 
hwp_epp flush_l1d


-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to