After profiling some code that I will discuss in a lightning talk at
JuliaCon I found a bottleneck in an unexpected place. (I guess I should
know by now that profiling always shows that the bottleneck is in an
unexpected place - that's why we do it.)
I have a matrix of matrices, some of which are dense and some sparse, for
which I wish to create the lower Cholesky factor. This operation is done
once for each evaluation of the objective function in a nonlinear
optimization and I want to be careful about creating copies. The chol!
fallback methods (argument is AbstractMatrix) in base/linalg/cholesky.jl
has operations like
A[k,k] -= A[k,i]*A[k,i]'
and
A[i,k] -= A[i,j]*A[k,j]'*AkkInv'
which, when the elements of A are matrices, I rewrote as
downdate!(A[k,k],A[k,i]) etc.
One of these methods is
function
downdate!{T<:FloatingPoint}(C::DenseMatrix{T},A::DenseMatrix{T},B::SparseMatrixCSC{T})
ma,na = size(A)
mb,nb = size(B)
ma == size(C,1) && mb == size(C,2) && na == nb ||
throw(DimensionMismatch(""))
rvb = rowvals(B); nzb = nonzeros(B)
for j in 1:nb
aj = sub(A,:,j)
ib = nzrange(B,j)
rvbj = sub(rvb,ib)
nzbj = sub(nzb,ib)
for k in eachindex(ib)
BLAS.axpy!(-nzbj[k],aj,sub(C,:,Int(rvbj[k])))
end
end
C
end
The surprising bottleneck is the call to sub(C,:,Int(rvbj[k])) within the
call to BLAS.axpy!. Fortunately there are sufficiently flexible BLAS.axpy!
method signatures that I think I can write around this using some pointer
arithmetic and pointer_to_array. At present the sub operation seems to be
taking about 2/3 of the time in the BLAS.axpy! call. That is, the linear
algebra is much faster than trying to determine which piece to work on.