Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Todd Leo Wed, 12 Nov 2014 00:20:36 -0800

Yes, I should have thought about that, very straight forward idea. Just 
utilize rowvals() respectively and take the intersect set to find the dense 
part in both vectors:


function dot_sparse(v::SparseMatrixCSC{Float64, 
Int64},w::SparseMatrixCSC{Float64, Int64})
    non_0_idx = intersect(rowvals(w), rowvals(v))
    _sum = 0.
    for i in non_0_idx
        _sum += v[i] * w[i]
    end
    _sum
end




On Wednesday, November 12, 2014 3:41:03 PM UTC+8, Milan Bouchet-Valat wrote:
>
> Le mardi 11 novembre 2014 à 19:29 -0800, Todd Leo a écrit : 
> > Thanks, now the  2-norm calculation for sparse matrices of cosine() 
> > can be optimized by your sumsq() function. But is dot product part of 
> > cosine() for sparse matrices optimizable? 
> I think so. Since the product of a zero and a nonzero element is, well, 
> zero, you just need to iterate over nonzero elements of one matrix and 
> multiply them with the corresponding element of the other matrix (if the 
> latter is zero you don't care). Not sure how to do this exactly off the 
> top of my head, but shouldn't be too hard. 
>
>
> Regards 
>
> > On Tuesday, November 11, 2014 6:57:44 PM UTC+8, Milan Bouchet-Valat 
> > wrote: 
> >         Le lundi 10 novembre 2014 à 19:06 -0800, Todd Leo a écrit : 
> >         > I do, actually, tried expanding vectorized operations into 
> >         > explicit for loops, and computing vector multiplication / 
> >         > vector norm in BLAS interfaces. For explicit loops, it did 
> >         > allocate less memory, but took much more time. Meanwhile, 
> >         > the vectorized version which I've been get used to write 
> >         > runs incredibly fast, as the following tests indicates: 
> >         I don't have a typical matrix like yours to test it, but your 
> >         cosine() function is adapted from an algorithm for dense 
> >         matrices. If your matrix is really sparse, you can do the 
> >         computations only for the nonzero entries when multiplying i 
> >         by itself and j by itself. I've put together a short function 
> >         to compute the sum of squares here: 
> >         https://gist.github.com/nalimilan/3ab9922294663b2ec979 
> >         
> >         BTW, looks like something like that is needed in Base to 
> >         specialize sumabs2() for sparse matrices, as the current 
> >         generic version is terribly slow in that case. 
> >         
> >         It could easily be extended to compute i .* j if the nonzeros 
> >         entries in both matrices match. If not, more work is needed, 
> >         but you could adapt the .* function from [1] to compute the 
> >         sum on the fly instead of allocating space for a new matrix. 
> >         
> >         
> >         Regards 
> >         
> >         1: 
> >         
> https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530
>  
> >         > # Explicit for loop, slightly modified from 
> >         > SimilarityMetric.jl by johnmyleswhite 
> >         > (
> https://github.com/johnmyleswhite/SimilarityMetrics.jl/blob/master/src/cosine.jl)
>  
>
> >         > function cosine(a::SparseMatrixCSC{Float64, Int64}, 
> >         > b::SparseMatrixCSC{Float64, Int64}) 
> >         > sA, sB, sI = 0.0, 0.0, 0.0 
> >         > for i in 1:length(a) 
> >         > sA += a[i]^2 
> >         > sI += a[i] * b[i] 
> >         > end 
> >         > for i in 1:length(b) 
> >         > sB += b[i]^2 
> >         > end 
> >         > return sI / sqrt(sA * sB) 
> >         > end 
> >         > 
> >         > # BLAS version 
> >         > function cosine_blas(i::SparseMatrixCSC{Float64, Int64}, 
> >         > j::SparseMatrixCSC{Float64, Int64}) 
> >         >     i = full(i) 
> >         >     j = full(j) 
> >         >     numerator = BLAS.dot(i, j) 
> >         >     denominator = BLAS.nrm2(i) * BLAS.nrm2(j) 
> >         >     return numerator / denominator 
> >         > end 
> >         > 
> >         > # the vectorized version remains the same, as the 1st post 
> >         > shows. 
> >         > 
> >         > # Test functions 
> >         > function test_explicit_loop(d) 
> >         >         for n in 1:10000 
> >         >             v = d[:,1] 
> >         >             cosine(v,v) 
> >         >         end 
> >         >     end 
> >         >   
> >         >     function test_blas(d) 
> >         >         for n in 1:10000 
> >         >             v = d[:,1] 
> >         >             cosine_blas(v,v) 
> >         >         end 
> >         >     end 
> >         >   
> >         >     function test_vectorized(d) 
> >         >         for n in 1:10000 
> >         >             v = d[:,1] 
> >         >             cosine_vectorized(v,v) 
> >         >         end 
> >         >     end 
> >         > 
> >         > 
> >         >     test_explicit_loop(mat) 
> >         >     test_blas(mat) 
> >         >     test_vectorized(mat) 
> >         >     gc() 
> >         >     @time test_explicit_loop(mat) 
> >         >     gc() 
> >         >     @time test_blas(mat) 
> >         >     gc() 
> >         >     @time test_vectorized(mat) 
> >         > 
> >         > # Results 
> >         > elapsed time: 3.772606858 seconds (6240080 bytes allocated) 
> >         > elapsed time: 0.400972089 seconds (327520080 bytes 
> >         > allocated, 81.58% gc time) 
> >         > elapsed time: 0.011236068 seconds (34560080 bytes allocated) 
> >         > 
> >         > 
> >         > 
> >         > 
> >         > On Monday, November 10, 2014 7:23:17 PM UTC+8, Milan 
> >         > Bouchet-Valat wrote: 
> >         >         Le dimanche 09 novembre 2014 à 21:17 -0800, Todd Leo 
> >         >         a écrit : 
> >         >         > Hi fellows, 
> >         >         > 
> >         >         > 
> >         >         > I'm currently working on sparse matrix and cosine 
> >         >         > similarity computation, but my routines is running 
> >         >         > very slow, at least not reach my expectation. So I 
> >         >         > wrote some test functions, to dig out the reason 
> >         >         > of ineffectiveness. To my surprise, the execution 
> >         >         > time of passing two vectors to the test function 
> >         >         > and passing the whole sparse matrix differs 
> >         >         > greatly, the latter is 80x faster. I am wondering 
> >         >         > why extracting two vectors of the matrix in each 
> >         >         > loop is dramatically faster that much, and how to 
> >         >         > avoid the multi-GB memory allocate. Thanks guys. 
> >         >         > 
> >         >         > 
> >         >         > -- 
> >         >         > BEST REGARDS, 
> >         >         > Todd Leo 
> >         >         > 
> >         >         > 
> >         >         > # The sparse matrix 
> >         >         > mat # 2000x15037 SparseMatrixCSC{Float64, Int64} 
> >         >         > 
> >         >         > 
> >         >         > # The two vectors, prepared in advance 
> >         >         > v = mat'[:,1] 
> >         >         > w = mat'[:,2] 
> >         >         > 
> >         >         > 
> >         >         > # Cosine similarity function 
> >         >         > function 
> >         >         > cosine_vectorized(i::SparseMatrixCSC{Float64, 
> >         >         > Int64}, j::SparseMatrixCSC{Float64, Int64}) 
> >         >         >     return sum(i .* j)/sqrt(sum(i.*i)*sum(j.*j)) 
> >         >         > end 
> >         >         I think you'll experience a dramatic speed gain if 
> >         >         you write the sums in explicit loops, accessing 
> >         >         elements one by one, taking their product and adding 
> >         >         it immediately to a counter. In your current 
> >         >         version, the element-wise products allocate new 
> >         >         vectors before computing the sums, which is very 
> >         >         costly. 
> >         >         
> >         >         This will also get rid of the difference you report 
> >         >         between passing arrays and vectors. 
> >         >         
> >         >         
> >         >         Regards 
> >         >         
> >         >         > function test1(d) 
> >         >         >     res = 0. 
> >         >         >     for i in 1:10000 
> >         >         >         res = cosine_vectorized(d[:,1], d[:,2]) 
> >         >         >     end 
> >         >         > end 
> >         >         > 
> >         >         > 
> >         >         > function test2(_v,_w) 
> >         >         >     res = 0. 
> >         >         >     for i in 1:10000 
> >         >         >         res = cosine_vectorized(_v, _w) 
> >         >         >     end 
> >         >         > end 
> >         >         > 
> >         >         > 
> >         >         > test1(dtm) 
> >         >         > test2(v,w) 
> >         >         > gc() 
> >         >         > @time test1(dtm) 
> >         >         > gc() 
> >         >         > @time test2(v,w) 
> >         >         > 
> >         >         > 
> >         >         > #elapsed time: 0.054925372 seconds (59360080 bytes 
> >         >         > allocated, 59.07% gc time) 
> >         >         > 
> >         >         > #elapsed time: 4.204132608 seconds (3684160080 
> >         >         > bytes allocated, 65.51% gc time) 
> >         >         > 
> >         >         
> >         >         
> >         
> >         
>
>

Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Reply via email to