I did blas_set_num_threads(1) with the same profile numbers. This is using Apple’s BLAS.
Maybe I’ll try 0.5 and OpenBLAS for comparison. > On 10 Sep 2016, at 2:34 AM, Andreas Noack <andreasnoackjen...@gmail.com> > wrote: > > Try to time it again with threading disabled. Sometimes the threading > heuristics can cause unintuitive performance. > > On Friday, September 9, 2016 at 6:39:13 AM UTC-4, Sheehan Olver wrote: > > I have the following code that is part of a Householder routine, where > j::Int64, N::Int64, R.cols::Vector{Int64}, wp::Ptr{Float64}, M::Int64, > v::Ptr{Float64}: > > … > for j=k:N > v=r+(R.cols[j]+k-2)*sz > dt=BLAS.dot(M,wp,1,v,1) > BLAS.axpy!(M,-2*dt,wp,1,v,1) > end > … > > > > For some reason, the BLAS.dot call takes 3x as long as the BLAS.axpy! call. > Is this expected, or is there something wrong? > >