Hey!
The benchmark you posted, Cayetano, is:
julia -e 'using Pkg; Pkg.add("BenchmarkTools"); using BenchmarkTools; N =
1000; A = rand(N, N); B = rand(N, N); @btime $A*$B'
This is a matrix multiplication that gets delegated to the underlying
BLAS right. Running it under ‘perf record’ confirms it:
--8<---------------cut here---------------start------------->8---
Samples: 139K of event 'cycles:u', Event count (approx.): 99624880590
Overhead Command Shared Object Symbol
35.27% .julia-real libblas.so.3.9.0 [.] dgemm_
3.99% .julia-real libjulia-internal.so.1.8 [.] gc_mark_loop
2.60% .julia-real libjulia-internal.so.1.8 [.] apply_cl
1.06% .julia-real libjulia-internal.so.1.8 [.] jl_get_binding_
--8<---------------cut here---------------end--------------->8---
We’re using libblas.so (presumably from the ‘lapack’ package) and not
OpenBLAS, so no wonder it’s slow.
Could it be that:
"LIBBLAS=-lopenblas"
"LIBBLASNAME=libopenblas"
is ineffective? I think we have a lead!
Ludo’.