For my numerics class at MIT <http://math.mit.edu/~stevenj/18.335/>, I used the following notebook to talk about cache effects and matrix multiplication:
http://nbviewer.ipython.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb It includes some code to benchmark the built-in BLAS-based multiplication against some simpler algorithms, and for comparison purposes I used blas_set_num_threads(1) to benchmark only serial performance... I thought. When I ran the benchmark on my desktop, the results made sense: OpenBLAS got about 3 * 4 Gflops, which is peak performance for a 3GHz CPU that can perform 4 flops per cycle (via 256-bit AVX instructions). However, on my laptop, it got about 40 gigaflops, which only makes sense if it was using additional cores. In both cases, this was with Julia 0.4 using OpenBLAS. Is there any reason why blas_set_num_threads(1) would not be sufficient to disable additional cores? --SGJ
