Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Todd Leo Tue, 11 Nov 2014 19:30:07 -0800

Thanks, now the  2-norm calculation for sparse matrices of cosine() can be 
optimized by your sumsq() function. But is dot product part of cosine() for 
sparse matrices optimizable?


On Tuesday, November 11, 2014 6:57:44 PM UTC+8, Milan Bouchet-Valat wrote:
>
>  Le lundi 10 novembre 2014 à 19:06 -0800, Todd Leo a écrit : 
>
> I do, actually, tried expanding vectorized operations into explicit for 
> loops, and computing vector multiplication / vector norm in BLAS 
> interfaces. For explicit loops, it did allocate less memory, but took much 
> more time. Meanwhile, the vectorized version which I've been get used to 
> write runs incredibly fast, as the following tests indicates: 
>
> I don't have a typical matrix like yours to test it, but your cosine() 
> function is adapted from an algorithm for dense matrices. If your matrix is 
> really sparse, you can do the computations only for the nonzero entries 
> when multiplying i by itself and j by itself. I've put together a short 
> function to compute the sum of squares here:
> https://gist.github.com/nalimilan/3ab9922294663b2ec979
>
> BTW, looks like something like that is needed in Base to specialize 
> sumabs2() for sparse matrices, as the current generic version is terribly 
> slow in that case.
>
> It could easily be extended to compute i .* j if the nonzeros entries in 
> both matrices match. If not, more work is needed, but you could adapt the 
> .* function from [1] to compute the sum on the fly instead of allocating 
> space for a new matrix.
>
>
> Regards
>
> 1: 
> https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530
>  
>
> # Explicit for loop, slightly modified from SimilarityMetric.jl by 
> johnmyleswhite (
> https://github.com/johnmyleswhite/SimilarityMetrics.jl/blob/master/src/cosine.jl
> )
> function cosine(a::SparseMatrixCSC{Float64, Int64}, 
> b::SparseMatrixCSC{Float64, Int64}) 
>
>  sA, sB, sI = 0.0, 0.0, 0.0 
>
>  for i in 1:length(a) 
>
>  sA += a[i]^2 
>
>  sI += a[i] * b[i] 
>
>  end 
>
>  for i in 1:length(b) 
>
>  sB += b[i]^2 
>
>  end 
>
>  return sI / sqrt(sA * sB) 
>
>  end
>
> # BLAS version 
>
>  function cosine_blas(i::SparseMatrixCSC{Float64, Int64}, 
> j::SparseMatrixCSC{Float64, Int64}) 
>
>      i = full(i) 
>
>      j = full(j) 
>
>      numerator = BLAS.dot(i, j) 
>
>      denominator = BLAS.nrm2(i) * BLAS.nrm2(j) 
>
>      return numerator / denominator 
>
>  end
>
> # the vectorized version remains the same, as the 1st post shows.
>
> # Test functions 
>
>  function test_explicit_loop(d) 
>
>          for n in 1:10000 
>
>              v = d[:,1] 
>
>              cosine(v,v) 
>
>          end 
>
>      end 
>
>     
>
>      function test_blas(d) 
>
>          for n in 1:10000 
>
>              v = d[:,1] 
>
>              cosine_blas(v,v) 
>
>          end 
>
>      end 
>
>     
>
>      function test_vectorized(d) 
>
>          for n in 1:10000 
>
>              v = d[:,1] 
>
>              cosine_vectorized(v,v) 
>
>          end 
>
>      end 
>
>  
>
>      test_explicit_loop(mat) 
>
>      test_blas(mat) 
>
>      test_vectorized(mat) 
>
>      gc() 
>
>      @time test_explicit_loop(mat) 
>
>      gc() 
>
>      @time test_blas(mat) 
>
>      gc() 
>
>      @time test_vectorized(mat)
>
> # Results
> elapsed time: 3.772606858 seconds (6240080 bytes allocated)
> elapsed time: 0.400972089 seconds (327520080 bytes allocated, 81.58% gc 
> time)
> elapsed time: 0.011236068 seconds (34560080 bytes allocated)
>
>  
>
>  
> On Monday, November 10, 2014 7:23:17 PM UTC+8, Milan Bouchet-Valat wrote: 
>
>  Le dimanche 09 novembre 2014 à 21:17 -0800, Todd Leo a écrit : 
>
> Hi fellows, 
>
>
> I'm currently working on sparse matrix and cosine similarity computation, 
> but my routines is running very slow, at least not reach my expectation. So 
> I wrote some test functions, to dig out the reason of ineffectiveness. To 
> my surprise, the execution time of passing two vectors to the test function 
> and passing the whole sparse matrix differs greatly, the latter is 80x 
> faster. I am wondering why extracting two vectors of the matrix in each 
> loop is dramatically faster that much, and how to avoid the multi-GB memory 
> allocate. Thanks guys.
>
>
> --
> BEST REGARDS,
> Todd Leo
>
>
> # The sparse matrix
> mat # 2000x15037 SparseMatrixCSC{Float64, Int64}
>
>
> # The two vectors, prepared in advance
> v = mat'[:,1]
> w = mat'[:,2]
>
>
> # Cosine similarity function
> function cosine_vectorized(i::SparseMatrixCSC{Float64, Int64}, 
> j::SparseMatrixCSC{Float64, Int64})
>     return sum(i .* j)/sqrt(sum(i.*i)*sum(j.*j))
> end
>
> I think you'll experience a dramatic speed gain if you write the sums in 
> explicit loops, accessing elements one by one, taking their product and 
> adding it immediately to a counter. In your current version, the 
> element-wise products allocate new vectors before computing the sums, which 
> is very costly.
>
> This will also get rid of the difference you report between passing arrays 
> and vectors.
>
>
> Regards
>
>  function test1(d)
>     res = 0.
>     for i in 1:10000
>         res = cosine_vectorized(d[:,1], d[:,2])
>     end
> end
>
>
> function test2(_v,_w)
>     res = 0.
>     for i in 1:10000
>         res = cosine_vectorized(_v, _w)
>     end
> end
>
>
> test1(dtm)
> test2(v,w)
> gc()
> @time test1(dtm)
> gc()
> @time test2(v,w)
>
>
> #elapsed time: 0.054925372 seconds (59360080 bytes allocated, 59.07% gc 
> time)
>
> #elapsed time: 4.204132608 seconds (3684160080 bytes allocated, 65.51% gc 
> time)
>
>  
>
>  
>

Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Reply via email to