On Thursday, December 4, 2014 4:17:19 PM UTC-6, Johan Sigfrids wrote: > > The new AMD architectures are weird in that they have two integer cores > share the same FP hardware so you half the FP cores compared to integer > cores. The reported number of cores is based in integer cores. >
Thanks. That helps to explain a lot. > > On Thursday, December 4, 2014 11:13:38 PM UTC+2, Douglas Bates wrote: >> >> On Thursday, December 4, 2014 2:32:01 PM UTC-6, Stefan Karpinski wrote: >>> >>> Hyperthreading? Of the threshold is 16 but you're really only getting 8 >>> cores, you might only get scaling up to 8. >>> >> >> This machine has AMD Opteron processors. I know Intel uses >> hyperthreading, does AMD also use it? >> >> I recompiled OpenBLAS setting the NUM_THREADS to 32 but still get the >> same result - essentially no difference between 8 and 16 threads. >> julia> blas_set_num_threads(4) >> >> julia> [peakflops(8000)::Float64 for i in 1:6] >> 6-element Array{Float64,1}: >> 8.66448e10 >> 8.67398e10 >> 8.67465e10 >> 8.68957e10 >> 8.69717e10 >> 8.70661e10 >> >> julia> blas_set_num_threads(8) >> >> julia> [peakflops(8000)::Float64 for i in 1:6] >> 6-element Array{Float64,1}: >> 1.67257e11 >> 1.66041e11 >> 1.65284e11 >> 1.65565e11 >> 1.65867e11 >> 1.65596e11 >> >> julia> blas_set_num_threads(16) >> >> julia> [peakflops(8000)::Float64 for i in 1:6] >> 6-element Array{Float64,1}: >> 1.65354e11 >> 1.7099e11 >> 1.70911e11 >> 1.71407e11 >> 1.71238e11 >> 1.70983e11 >> >> >> >> >> >>> >>> >>> > On Dec 4, 2014, at 3:24 PM, Viral Shah <vi...@mayin.org> wrote: >>> > >>> > >>> >> On 05-Dec-2014, at 1:32 am, Douglas Bates <dmb...@gmail.com> wrote: >>> >> >>> >> On Thursday, December 4, 2014 1:50:06 PM UTC-6, Viral Shah wrote: >>> >>> On 05-Dec-2014, at 1:16 am, Douglas Bates <dmb...@gmail.com >>> <javascript:>> wrote: >>> >>> >>> >>> Thanks, I'll try that. I'm still curious as to why there is so >>> little difference between 8 and 16 threads. >>> >> >>> >> peakflops() just performs a matrix multiplication to estimate the >>> flops. It uses a 2000x2000 matrix by default, which is good for most >>> laptops, but for bigger machines with more cores, one often needs to use a >>> larger matrix to see the speedup. >>> >> >>> >> peakflops(8000) should give a good indication. I am not sure what the >>> running time will be, so you may want to gradually increase the size. >>> >> >>> >> >>> >> 8000 is reasonable on this machine and it does stabilize the results >>> from repeated timings. But I still have essentially no difference between >>> 8 and 16 threads. I wonder if somehow the NUM_THREADS is being set to 8, >>> although looking in the deps/Makefile it does seem that it should be 16 >>> > >>> > >>> > I tried on julia.mit.edu, and I do see a scale up from 1->16 >>> processors with peakflops(4000). That seems to suggest that the build is >>> ok, and openblas can scale. I think it would be best to check with Xianyi >>> about this - perhaps file an issue against OpenBLAS? >>> > >>> > Perhaps someone here may have some other ideas too. >>> > >>> > -viral >>> > >>> > >>> >> >>> >> julia> blas_set_num_threads(4) >>> >> >>> >> julia> [peakflops(8000)::Float64 for i in 1:6] >>> >> 6-element Array{Float64,1}: >>> >> 8.66823e10 >>> >> 8.65584e10 >>> >> 8.65692e10 >>> >> 8.64753e10 >>> >> 8.64083e10 >>> >> 8.63359e10 >>> >> >>> >> julia> blas_set_num_threads(8) >>> >> >>> >> julia> [peakflops(8000)::Float64 for i in 1:6] >>> >> 6-element Array{Float64,1}: >>> >> 1.68008e11 >>> >> 1.67772e11 >>> >> 1.67378e11 >>> >> 1.67397e11 >>> >> 1.6746e11 >>> >> 1.67623e11 >>> >> >>> >> julia> blas_set_num_threads(16) >>> >> >>> >> julia> [peakflops(8000)::Float64 for i in 1:6] >>> >> 6-element Array{Float64,1}: >>> >> 1.66779e11 >>> >> 1.70068e11 >>> >> 1.698e11 >>> >> 1.70419e11 >>> >> 1.70601e11 >>> >> 1.67226e11 >>> >> >>> >> >>> >> >>> >> -viral >>> >> >>> >> >>> >> >>> >>> >>> >>> -viral >>> >>> >>> >>> On Friday, December 5, 2014 1:00:39 AM UTC+5:30, Douglas Bates >>> wrote: >>> >>> I have been working on a package >>> https://github.com/dmbates/ParalllelGLM.jl < >>> https://github.com/dmbates/ParalllelGLM.jl> and noticed some >>> peculiarities in the timings on a couple of shared-memory servers, each >>> with 32 cores. In particular changing from 16 workers to 32 workers >>> actually slowed down the fitting process. So I decided to check how >>> changing the number of OpenBLAS threads affected the peakflops() result. I >>> end up with essentially the same results for 8, 16 and 32 threads on this >>> machine with 32 cores. Is that to be expected? >>> >>> >>> >>> _ _ _(_)_ | A fresh approach to technical computing >>> >>> (_) | (_) (_) | Documentation: http://docs.julialang.org < >>> http://docs.julialang.org/> >>> >>> _ _ _| |_ __ _ | Type "help()" for help. >>> >>> | | | | | | |/ _` | | >>> >>> | | |_| | | | (_| | | Version 0.4.0-dev+1944 (2014-12-04 15:06 >>> UTC) >>> >>> _/ |\__'_|_|_|\__'_| | Commit 87e9ee1* (0 days old master) >>> >>> |__/ | x86_64-unknown-linux-gnu >>> >>> >>> >>> julia> [peakflops()::Float64 for i in 1:6] >>> >>> 6-element Array{Float64,1}: >>> >>> 1.41151e11 >>> >>> 1.1676e11 >>> >>> 1.27597e11 >>> >>> 1.27607e11 >>> >>> 1.27518e11 >>> >>> 1.27478e11 >>> >>> >>> >>> julia> CPU_CORES >>> >>> 32 >>> >>> >>> >>> julia> blas_set_num_threads(16) >>> >>> >>> >>> julia> [peakflops()::Float64 for i in 1:6] >>> >>> 6-element Array{Float64,1}: >>> >>> 1.23523e11 >>> >>> 1.27119e11 >>> >>> 1.11381e11 >>> >>> 1.17847e11 >>> >>> 1.28415e11 >>> >>> 1.17998e11 >>> >>> >>> >>> julia> blas_set_num_threads(8) >>> >>> >>> >>> julia> [peakflops()::Float64 for i in 1:6] >>> >>> 6-element Array{Float64,1}: >>> >>> 1.25194e11 >>> >>> 1.20969e11 >>> >>> 1.25777e11 >>> >>> 1.20757e11 >>> >>> 1.26086e11 >>> >>> 1.20958e11 >>> >>> >>> >>> julia> versioninfo(true) >>> >>> Julia Version 0.4.0-dev+1944 >>> >>> Commit 87e9ee1* (2014-12-04 15:06 UTC) >>> >>> Platform Info: >>> >>> System: Linux (x86_64-unknown-linux-gnu) >>> >>> CPU: AMD Opteron(tm) Processor 6328 >>> >>> WORD_SIZE: 64 >>> >>> "Red Hat Enterprise Linux Server release 6.5 (Santiago)" >>> >>> uname: Linux 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Dec 13 06:58:20 >>> EST 2013 x86_64 x86_64 >>> >>> Memory: 504.78467178344727 GB (508598.8125 MB free) >>> >>> Uptime: 261586.0 sec >>> >>> Load Avg: 0.08740234375 0.19384765625 0.8330078125 >>> >>> AMD Opteron(tm) Processor 6328 : >>> >>> speed user nice sys idle >>> irq >>> >>> #1-32 3199 MHz 1855973 s 23392 s 670932 s 834073187 s >>> 21 s >>> >>> >>> >>> BLAS: libopenblas (USE64BITINT NO_AFFINITY PILEDRIVER) >>> >>> LAPACK: libopenblas >>> >>> LIBM: libopenlibm >>> >>> LLVM: libLLVM-3.5.0 >>> >>> Environment: >>> >>> TERM = screen >>> >>> PATH = >>> /s/cmake-3.0.2/bin:/s/gcc-4.9.2/bin:./u/b/a/bates/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/s/std/bin:/usr/afsws/bin: >>> >>> >>> >>> WWW_HOME = http://www.stat.wisc.edu/ <http://www.stat.wisc.edu/> >>> >>> JULIA_PKGDIR = /scratch/bates/.julia >>> >>> HOME = /u/b/a/bates >>> >>> >>> >>> Package Directory: /scratch/bates/.julia/v0.4 >>> >>> 2 required packages: >>> >>> - Distributions 0.6.1 >>> >>> - Docile 0.3.2 >>> >>> 5 additional packages: >>> >>> - ArrayViews 0.4.8 >>> >>> - Compat 0.2.5 >>> >>> - PDMats 0.3.1 >>> >>> - ParallelGLM 0.0.0- master >>> (unregistered) >>> >>> - StatsBase 0.6.10 >>> > >>> >>