After profiling some code that I will discuss in a lightning talk at 
JuliaCon I found a bottleneck in an unexpected place.  (I guess I should 
know by now that profiling always shows that the bottleneck is in an 
unexpected place - that's why we do it.)

I have a matrix of matrices, some of which are dense and some sparse, for 
which I wish to create the lower Cholesky factor.  This operation is done 
once for each evaluation of the objective function in a nonlinear 
optimization and I want to be careful about creating copies.  The chol! 
fallback methods (argument is AbstractMatrix) in base/linalg/cholesky.jl 
has operations like

                A[k,k] -= A[k,i]*A[k,i]'

and

                        A[i,k] -= A[i,j]*A[k,j]'*AkkInv'

which, when the elements of A are matrices, I rewrote as 
downdate!(A[k,k],A[k,i]) etc.

One of these methods is

function 
downdate!{T<:FloatingPoint}(C::DenseMatrix{T},A::DenseMatrix{T},B::SparseMatrixCSC{T})
    ma,na = size(A)
    mb,nb = size(B)
    ma == size(C,1) && mb == size(C,2) && na == nb || 
throw(DimensionMismatch(""))
    rvb = rowvals(B); nzb = nonzeros(B)
    for j in 1:nb
        aj = sub(A,:,j)
        ib = nzrange(B,j)
        rvbj = sub(rvb,ib)
        nzbj = sub(nzb,ib)
        for k in eachindex(ib)
            BLAS.axpy!(-nzbj[k],aj,sub(C,:,Int(rvbj[k])))
        end
    end
    C
end

The surprising bottleneck is the call to sub(C,:,Int(rvbj[k])) within the 
call to BLAS.axpy!.  Fortunately there are sufficiently flexible BLAS.axpy! 
method signatures that I think I can write around this using some pointer 
arithmetic and pointer_to_array.  At present the sub operation seems to be 
taking about 2/3 of the time in the BLAS.axpy! call.  That is, the linear 
algebra is much faster than trying to determine which piece to work on.




Reply via email to