For my numerics class at MIT <http://math.mit.edu/~stevenj/18.335/>, I used 
the following notebook to talk about cache effects and matrix 
multiplication:

  
  
http://nbviewer.ipython.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb

It includes some code to benchmark the built-in BLAS-based multiplication 
against some simpler algorithms, and for comparison purposes I used 
blas_set_num_threads(1) to benchmark only serial performance... I thought.

When I ran the benchmark on my desktop, the results made sense: OpenBLAS 
got about 3 * 4 Gflops, which is peak performance for a 3GHz CPU that can 
perform 4 flops per cycle (via 256-bit AVX instructions).   However, on my 
laptop, it got about 40 gigaflops, which only makes sense if it was using 
additional cores.  In both cases, this was with Julia 0.4 using OpenBLAS.

Is there any reason why blas_set_num_threads(1) would not be sufficient to 
disable additional cores?

--SGJ

Reply via email to