Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Milan Bouchet-Valat Tue, 11 Nov 2014 23:41:25 -0800

Le mardi 11 novembre 2014 à 19:29 -0800, Todd Leo a écrit :
> Thanks, now the  2-norm calculation for sparse matrices of cosine()
> can be optimized by your sumsq() function. But is dot product part of
> cosine() for sparse matrices optimizable?
I think so. Since the product of a zero and a nonzero element is, well,
zero, you just need to iterate over nonzero elements of one matrix and
multiply them with the corresponding element of the other matrix (if the
latter is zero you don't care). Not sure how to do this exactly off the
top of my head, but shouldn't be too hard.



Regards

> On Tuesday, November 11, 2014 6:57:44 PM UTC+8, Milan Bouchet-Valat
> wrote:
>         Le lundi 10 novembre 2014 à 19:06 -0800, Todd Leo a écrit : 
>         > I do, actually, tried expanding vectorized operations into
>         > explicit for loops, and computing vector multiplication /
>         > vector norm in BLAS interfaces. For explicit loops, it did
>         > allocate less memory, but took much more time. Meanwhile,
>         > the vectorized version which I've been get used to write
>         > runs incredibly fast, as the following tests indicates:
>         I don't have a typical matrix like yours to test it, but your
>         cosine() function is adapted from an algorithm for dense
>         matrices. If your matrix is really sparse, you can do the
>         computations only for the nonzero entries when multiplying i
>         by itself and j by itself. I've put together a short function
>         to compute the sum of squares here:
>         https://gist.github.com/nalimilan/3ab9922294663b2ec979
>         
>         BTW, looks like something like that is needed in Base to
>         specialize sumabs2() for sparse matrices, as the current
>         generic version is terribly slow in that case.
>         
>         It could easily be extended to compute i .* j if the nonzeros
>         entries in both matrices match. If not, more work is needed,
>         but you could adapt the .* function from [1] to compute the
>         sum on the fly instead of allocating space for a new matrix.
>         
>         
>         Regards
>         
>         1:
>         
> https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530
>  
>         > # Explicit for loop, slightly modified from
>         > SimilarityMetric.jl by johnmyleswhite
>         > 
> (https://github.com/johnmyleswhite/SimilarityMetrics.jl/blob/master/src/cosine.jl)
>         > function cosine(a::SparseMatrixCSC{Float64, Int64},
>         > b::SparseMatrixCSC{Float64, Int64})
>         > sA, sB, sI = 0.0, 0.0, 0.0
>         > for i in 1:length(a)
>         > sA += a[i]^2
>         > sI += a[i] * b[i]
>         > end
>         > for i in 1:length(b)
>         > sB += b[i]^2
>         > end
>         > return sI / sqrt(sA * sB)
>         > end
>         > 
>         > # BLAS version
>         > function cosine_blas(i::SparseMatrixCSC{Float64, Int64},
>         > j::SparseMatrixCSC{Float64, Int64})
>         >     i = full(i)
>         >     j = full(j)
>         >     numerator = BLAS.dot(i, j)
>         >     denominator = BLAS.nrm2(i) * BLAS.nrm2(j)
>         >     return numerator / denominator
>         > end
>         > 
>         > # the vectorized version remains the same, as the 1st post
>         > shows.
>         > 
>         > # Test functions
>         > function test_explicit_loop(d)
>         >         for n in 1:10000
>         >             v = d[:,1]
>         >             cosine(v,v)
>         >         end
>         >     end
>         >   
>         >     function test_blas(d)
>         >         for n in 1:10000
>         >             v = d[:,1]
>         >             cosine_blas(v,v)
>         >         end
>         >     end
>         >   
>         >     function test_vectorized(d)
>         >         for n in 1:10000
>         >             v = d[:,1]
>         >             cosine_vectorized(v,v)
>         >         end
>         >     end
>         > 
>         > 
>         >     test_explicit_loop(mat)
>         >     test_blas(mat)
>         >     test_vectorized(mat)
>         >     gc()
>         >     @time test_explicit_loop(mat)
>         >     gc()
>         >     @time test_blas(mat)
>         >     gc()
>         >     @time test_vectorized(mat)
>         > 
>         > # Results
>         > elapsed time: 3.772606858 seconds (6240080 bytes allocated)
>         > elapsed time: 0.400972089 seconds (327520080 bytes
>         > allocated, 81.58% gc time)
>         > elapsed time: 0.011236068 seconds (34560080 bytes allocated)
>         > 
>         > 
>         > 
>         > 
>         > On Monday, November 10, 2014 7:23:17 PM UTC+8, Milan
>         > Bouchet-Valat wrote:
>         >         Le dimanche 09 novembre 2014 à 21:17 -0800, Todd Leo
>         >         a écrit : 
>         >         > Hi fellows, 
>         >         > 
>         >         > 
>         >         > I'm currently working on sparse matrix and cosine
>         >         > similarity computation, but my routines is running
>         >         > very slow, at least not reach my expectation. So I
>         >         > wrote some test functions, to dig out the reason
>         >         > of ineffectiveness. To my surprise, the execution
>         >         > time of passing two vectors to the test function
>         >         > and passing the whole sparse matrix differs
>         >         > greatly, the latter is 80x faster. I am wondering
>         >         > why extracting two vectors of the matrix in each
>         >         > loop is dramatically faster that much, and how to
>         >         > avoid the multi-GB memory allocate. Thanks guys.
>         >         > 
>         >         > 
>         >         > --
>         >         > BEST REGARDS,
>         >         > Todd Leo
>         >         > 
>         >         > 
>         >         > # The sparse matrix
>         >         > mat # 2000x15037 SparseMatrixCSC{Float64, Int64}
>         >         > 
>         >         > 
>         >         > # The two vectors, prepared in advance
>         >         > v = mat'[:,1]
>         >         > w = mat'[:,2]
>         >         > 
>         >         > 
>         >         > # Cosine similarity function
>         >         > function
>         >         > cosine_vectorized(i::SparseMatrixCSC{Float64,
>         >         > Int64}, j::SparseMatrixCSC{Float64, Int64})
>         >         >     return sum(i .* j)/sqrt(sum(i.*i)*sum(j.*j))
>         >         > end
>         >         I think you'll experience a dramatic speed gain if
>         >         you write the sums in explicit loops, accessing
>         >         elements one by one, taking their product and adding
>         >         it immediately to a counter. In your current
>         >         version, the element-wise products allocate new
>         >         vectors before computing the sums, which is very
>         >         costly.
>         >         
>         >         This will also get rid of the difference you report
>         >         between passing arrays and vectors.
>         >         
>         >         
>         >         Regards
>         >         
>         >         > function test1(d)
>         >         >     res = 0.
>         >         >     for i in 1:10000
>         >         >         res = cosine_vectorized(d[:,1], d[:,2])
>         >         >     end
>         >         > end
>         >         > 
>         >         > 
>         >         > function test2(_v,_w)
>         >         >     res = 0.
>         >         >     for i in 1:10000
>         >         >         res = cosine_vectorized(_v, _w)
>         >         >     end
>         >         > end
>         >         > 
>         >         > 
>         >         > test1(dtm)
>         >         > test2(v,w)
>         >         > gc()
>         >         > @time test1(dtm)
>         >         > gc()
>         >         > @time test2(v,w)
>         >         > 
>         >         > 
>         >         > #elapsed time: 0.054925372 seconds (59360080 bytes
>         >         > allocated, 59.07% gc time)
>         >         > 
>         >         > #elapsed time: 4.204132608 seconds (3684160080
>         >         > bytes allocated, 65.51% gc time)
>         >         > 
>         >         
>         >         
>         
>

Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Reply via email to