On Thursday, December 4, 2014 4:17:19 PM UTC-6, Johan Sigfrids wrote:
>
> The new AMD architectures are weird in that they have two integer cores 
> share the same FP hardware so you half the FP cores compared to integer 
> cores. The reported number of cores is based in integer cores. 
>

Thanks.  That helps to explain a lot.
 

>
> On Thursday, December 4, 2014 11:13:38 PM UTC+2, Douglas Bates wrote:
>>
>> On Thursday, December 4, 2014 2:32:01 PM UTC-6, Stefan Karpinski wrote:
>>>
>>> Hyperthreading? Of the threshold is 16 but you're really only getting 8 
>>> cores, you might only get scaling up to 8. 
>>>
>>
>> This machine has AMD Opteron processors.  I know Intel uses 
>> hyperthreading, does AMD also use it?
>>
>> I recompiled OpenBLAS setting the NUM_THREADS to 32 but still get the 
>> same result - essentially no difference between 8 and 16 threads.
>> julia> blas_set_num_threads(4)
>>
>> julia> [peakflops(8000)::Float64 for i in 1:6]
>> 6-element Array{Float64,1}:
>>  8.66448e10
>>  8.67398e10
>>  8.67465e10
>>  8.68957e10
>>  8.69717e10
>>  8.70661e10
>>
>> julia> blas_set_num_threads(8)
>>
>> julia> [peakflops(8000)::Float64 for i in 1:6]
>> 6-element Array{Float64,1}:
>>  1.67257e11
>>  1.66041e11
>>  1.65284e11
>>  1.65565e11
>>  1.65867e11
>>  1.65596e11
>>
>> julia> blas_set_num_threads(16)
>>
>> julia> [peakflops(8000)::Float64 for i in 1:6]
>> 6-element Array{Float64,1}:
>>  1.65354e11
>>  1.7099e11 
>>  1.70911e11
>>  1.71407e11
>>  1.71238e11
>>  1.70983e11
>>
>>
>>
>>  
>>
>>>
>>>
>>> > On Dec 4, 2014, at 3:24 PM, Viral Shah <vi...@mayin.org> wrote: 
>>> > 
>>> > 
>>> >> On 05-Dec-2014, at 1:32 am, Douglas Bates <dmb...@gmail.com> wrote: 
>>> >> 
>>> >> On Thursday, December 4, 2014 1:50:06 PM UTC-6, Viral Shah wrote: 
>>> >>> On 05-Dec-2014, at 1:16 am, Douglas Bates <dmb...@gmail.com 
>>> <javascript:>> wrote: 
>>> >>> 
>>> >>> Thanks, I'll try that.  I'm still curious as to why there is so 
>>> little difference between 8 and 16 threads. 
>>> >> 
>>> >> peakflops() just performs a matrix multiplication to estimate the 
>>> flops. It uses a 2000x2000 matrix by default, which is good for most 
>>> laptops, but for bigger machines with more cores, one often needs to use a 
>>> larger matrix to see the speedup. 
>>> >> 
>>> >> peakflops(8000) should give a good indication. I am not sure what the 
>>> running time will be, so you may want to gradually increase the size. 
>>> >> 
>>> >> 
>>> >> 8000 is reasonable on this machine and it does stabilize the results 
>>> from repeated timings.  But I still have essentially no difference between 
>>> 8 and 16 threads.  I wonder if somehow the NUM_THREADS is being set to 8, 
>>> although looking in the deps/Makefile it does seem that it should be 16 
>>> > 
>>> > 
>>> > I tried on julia.mit.edu, and I do see a scale up from 1->16 
>>> processors with peakflops(4000). That seems to suggest that the build is 
>>> ok, and openblas can scale. I think it would be best to check with Xianyi 
>>> about this - perhaps file an issue against OpenBLAS? 
>>> > 
>>> > Perhaps someone here may have some other ideas too. 
>>> > 
>>> > -viral 
>>> > 
>>> > 
>>> >> 
>>> >> julia> blas_set_num_threads(4) 
>>> >> 
>>> >> julia> [peakflops(8000)::Float64 for i in 1:6] 
>>> >> 6-element Array{Float64,1}: 
>>> >> 8.66823e10 
>>> >> 8.65584e10 
>>> >> 8.65692e10 
>>> >> 8.64753e10 
>>> >> 8.64083e10 
>>> >> 8.63359e10 
>>> >> 
>>> >> julia> blas_set_num_threads(8) 
>>> >> 
>>> >> julia> [peakflops(8000)::Float64 for i in 1:6] 
>>> >> 6-element Array{Float64,1}: 
>>> >> 1.68008e11 
>>> >> 1.67772e11 
>>> >> 1.67378e11 
>>> >> 1.67397e11 
>>> >> 1.6746e11 
>>> >> 1.67623e11 
>>> >> 
>>> >> julia> blas_set_num_threads(16) 
>>> >> 
>>> >> julia> [peakflops(8000)::Float64 for i in 1:6] 
>>> >> 6-element Array{Float64,1}: 
>>> >> 1.66779e11 
>>> >> 1.70068e11 
>>> >> 1.698e11   
>>> >> 1.70419e11 
>>> >> 1.70601e11 
>>> >> 1.67226e11 
>>> >> 
>>> >> 
>>> >> 
>>> >> -viral 
>>> >> 
>>> >> 
>>> >> 
>>> >>> 
>>> >>> -viral 
>>> >>> 
>>> >>> On Friday, December 5, 2014 1:00:39 AM UTC+5:30, Douglas Bates 
>>> wrote: 
>>> >>> I have been working on a package 
>>> https://github.com/dmbates/ParalllelGLM.jl <
>>> https://github.com/dmbates/ParalllelGLM.jl> and noticed some 
>>> peculiarities in the timings on a couple of shared-memory servers, each 
>>> with 32 cores.  In particular changing from 16 workers to 32 workers 
>>> actually slowed down the fitting process.  So I decided to check how 
>>> changing the number of OpenBLAS threads affected the peakflops() result.  I 
>>> end up with essentially the same results for 8, 16 and 32 threads on this 
>>> machine with 32 cores.  Is that to be expected? 
>>> >>> 
>>> >>>   _       _ _(_)_     |  A fresh approach to technical computing 
>>> >>>  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org <
>>> http://docs.julialang.org/> 
>>> >>>   _ _   _| |_  __ _   |  Type "help()" for help. 
>>> >>>  | | | | | | |/ _` |  | 
>>> >>>  | | |_| | | | (_| |  |  Version 0.4.0-dev+1944 (2014-12-04 15:06 
>>> UTC) 
>>> >>> _/ |\__'_|_|_|\__'_|  |  Commit 87e9ee1* (0 days old master) 
>>> >>> |__/                   |  x86_64-unknown-linux-gnu 
>>> >>> 
>>> >>> julia> [peakflops()::Float64 for i in 1:6] 
>>> >>> 6-element Array{Float64,1}: 
>>> >>> 1.41151e11 
>>> >>> 1.1676e11 
>>> >>> 1.27597e11 
>>> >>> 1.27607e11 
>>> >>> 1.27518e11 
>>> >>> 1.27478e11 
>>> >>> 
>>> >>> julia> CPU_CORES 
>>> >>> 32 
>>> >>> 
>>> >>> julia> blas_set_num_threads(16) 
>>> >>> 
>>> >>> julia> [peakflops()::Float64 for i in 1:6] 
>>> >>> 6-element Array{Float64,1}: 
>>> >>> 1.23523e11 
>>> >>> 1.27119e11 
>>> >>> 1.11381e11 
>>> >>> 1.17847e11 
>>> >>> 1.28415e11 
>>> >>> 1.17998e11 
>>> >>> 
>>> >>> julia> blas_set_num_threads(8) 
>>> >>> 
>>> >>> julia> [peakflops()::Float64 for i in 1:6] 
>>> >>> 6-element Array{Float64,1}: 
>>> >>> 1.25194e11 
>>> >>> 1.20969e11 
>>> >>> 1.25777e11 
>>> >>> 1.20757e11 
>>> >>> 1.26086e11 
>>> >>> 1.20958e11 
>>> >>> 
>>> >>> julia> versioninfo(true) 
>>> >>> Julia Version 0.4.0-dev+1944 
>>> >>> Commit 87e9ee1* (2014-12-04 15:06 UTC) 
>>> >>> Platform Info: 
>>> >>>  System: Linux (x86_64-unknown-linux-gnu) 
>>> >>>  CPU: AMD Opteron(tm) Processor 6328                 
>>> >>>  WORD_SIZE: 64 
>>> >>>           "Red Hat Enterprise Linux Server release 6.5 (Santiago)" 
>>> >>>  uname: Linux 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Dec 13 06:58:20 
>>> EST 2013 x86_64 x86_64 
>>> >>> Memory: 504.78467178344727 GB (508598.8125 MB free) 
>>> >>> Uptime: 261586.0 sec 
>>> >>> Load Avg:  0.08740234375  0.19384765625  0.8330078125 
>>> >>> AMD Opteron(tm) Processor 6328                 : 
>>> >>>          speed         user         nice          sys         idle   
>>>        irq 
>>> >>> #1-32  3199 MHz    1855973 s      23392 s     670932 s  834073187 s 
>>>         21 s 
>>> >>> 
>>> >>>  BLAS: libopenblas (USE64BITINT NO_AFFINITY PILEDRIVER) 
>>> >>>  LAPACK: libopenblas 
>>> >>>  LIBM: libopenlibm 
>>> >>>  LLVM: libLLVM-3.5.0 
>>> >>> Environment: 
>>> >>>  TERM = screen 
>>> >>>  PATH = 
>>> /s/cmake-3.0.2/bin:/s/gcc-4.9.2/bin:./u/b/a/bates/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/s/std/bin:/usr/afsws/bin:
>>>  
>>>
>>> >>>  WWW_HOME = http://www.stat.wisc.edu/ <http://www.stat.wisc.edu/> 
>>> >>>  JULIA_PKGDIR = /scratch/bates/.julia 
>>> >>>  HOME = /u/b/a/bates 
>>> >>> 
>>> >>> Package Directory: /scratch/bates/.julia/v0.4 
>>> >>> 2 required packages: 
>>> >>> - Distributions                 0.6.1 
>>> >>> - Docile                        0.3.2 
>>> >>> 5 additional packages: 
>>> >>> - ArrayViews                    0.4.8 
>>> >>> - Compat                        0.2.5 
>>> >>> - PDMats                        0.3.1 
>>> >>> - ParallelGLM                   0.0.0-             master 
>>> (unregistered) 
>>> >>> - StatsBase                     0.6.10 
>>> > 
>>>
>>

Reply via email to