Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Milan Bouchet-Valat Tue, 11 Nov 2014 02:58:41 -0800

Le lundi 10 novembre 2014 à 19:06 -0800, Todd Leo a écrit :
> I do, actually, tried expanding vectorized operations into explicit
> for loops, and computing vector multiplication / vector norm in BLAS
> interfaces. For explicit loops, it did allocate less memory, but took
> much more time. Meanwhile, the vectorized version which I've been get
> used to write runs incredibly fast, as the following tests indicates:
> 
I don't have a typical matrix like yours to test it, but your cosine()
function is adapted from an algorithm for dense matrices. If your matrix
is really sparse, you can do the computations only for the nonzero
entries when multiplying i by itself and j by itself. I've put together
a short function to compute the sum of squares here:
https://gist.github.com/nalimilan/3ab9922294663b2ec979


BTW, looks like something like that is needed in Base to specialize
sumabs2() for sparse matrices, as the current generic version is
terribly slow in that case.

It could easily be extended to compute i .* j if the nonzeros entries in
both matrices match. If not, more work is needed, but you could adapt
the .* function from [1] to compute the sum on the fly instead of
allocating space for a new matrix.


Regards

1:
https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530
> # Explicit for loop, slightly modified from SimilarityMetric.jl by
> johnmyleswhite
> (https://github.com/johnmyleswhite/SimilarityMetrics.jl/blob/master/src/cosine.jl)
> function cosine(a::SparseMatrixCSC{Float64, Int64},
> b::SparseMatrixCSC{Float64, Int64})
> sA, sB, sI = 0.0, 0.0, 0.0
> for i in 1:length(a)
> sA += a[i]^2
> sI += a[i] * b[i]
> end
> for i in 1:length(b)
> sB += b[i]^2
> end
> return sI / sqrt(sA * sB)
> end
> 
> # BLAS version
> 
> function cosine_blas(i::SparseMatrixCSC{Float64, Int64},
> j::SparseMatrixCSC{Float64, Int64})
>     i = full(i)
>     j = full(j)
>     numerator = BLAS.dot(i, j)
>     denominator = BLAS.nrm2(i) * BLAS.nrm2(j)
>     return numerator / denominator
> end
> 
> # the vectorized version remains the same, as the 1st post shows.
> 
> # Test functions
> 
> function test_explicit_loop(d)
>         for n in 1:10000
>             v = d[:,1]
>             cosine(v,v)
>         end
>     end
>   
>     function test_blas(d)
>         for n in 1:10000
>             v = d[:,1]
>             cosine_blas(v,v)
>         end
>     end
>   
>     function test_vectorized(d)
>         for n in 1:10000
>             v = d[:,1]
>             cosine_vectorized(v,v)
>         end
>     end
> 
> 
>     test_explicit_loop(mat)
>     test_blas(mat)
>     test_vectorized(mat)
>     gc()
>     @time test_explicit_loop(mat)
>     gc()
>     @time test_blas(mat)
>     gc()
>     @time test_vectorized(mat)
> 
> # Results
> elapsed time: 3.772606858 seconds (6240080 bytes allocated)
> elapsed time: 0.400972089 seconds (327520080 bytes allocated, 81.58%
> gc time)
> elapsed time: 0.011236068 seconds (34560080 bytes allocated)
> 
> 
> 
> 
> 
> On Monday, November 10, 2014 7:23:17 PM UTC+8, Milan Bouchet-Valat
> wrote:
> 
>         Le dimanche 09 novembre 2014 à 21:17 -0800, Todd Leo a
>         écrit : 
>         
>         > Hi fellows, 
>         > 
>         > 
>         > I'm currently working on sparse matrix and cosine similarity
>         > computation, but my routines is running very slow, at least
>         > not reach my expectation. So I wrote some test functions, to
>         > dig out the reason of ineffectiveness. To my surprise, the
>         > execution time of passing two vectors to the test function
>         > and passing the whole sparse matrix differs greatly, the
>         > latter is 80x faster. I am wondering why extracting two
>         > vectors of the matrix in each loop is dramatically faster
>         > that much, and how to avoid the multi-GB memory allocate.
>         > Thanks guys.
>         > 
>         > 
>         > --
>         > BEST REGARDS,
>         > Todd Leo
>         > 
>         > 
>         > # The sparse matrix
>         > mat # 2000x15037 SparseMatrixCSC{Float64, Int64}
>         > 
>         > 
>         > # The two vectors, prepared in advance
>         > v = mat'[:,1]
>         > w = mat'[:,2]
>         > 
>         > 
>         > # Cosine similarity function
>         > function cosine_vectorized(i::SparseMatrixCSC{Float64,
>         > Int64}, j::SparseMatrixCSC{Float64, Int64})
>         >     return sum(i .* j)/sqrt(sum(i.*i)*sum(j.*j))
>         > end
>         
>         I think you'll experience a dramatic speed gain if you write
>         the sums in explicit loops, accessing elements one by one,
>         taking their product and adding it immediately to a counter.
>         In your current version, the element-wise products allocate
>         new vectors before computing the sums, which is very costly.
>         
>         This will also get rid of the difference you report between
>         passing arrays and vectors.
>         
>         
>         Regards
>         
>         
>         > function test1(d)
>         >     res = 0.
>         >     for i in 1:10000
>         >         res = cosine_vectorized(d[:,1], d[:,2])
>         >     end
>         > end
>         > 
>         > 
>         > function test2(_v,_w)
>         >     res = 0.
>         >     for i in 1:10000
>         >         res = cosine_vectorized(_v, _w)
>         >     end
>         > end
>         > 
>         > 
>         > test1(dtm)
>         > test2(v,w)
>         > gc()
>         > @time test1(dtm)
>         > gc()
>         > @time test2(v,w)
>         > 
>         > 
>         > #elapsed time: 0.054925372 seconds (59360080 bytes
>         > allocated, 59.07% gc time)
>         > 
>         > #elapsed time: 4.204132608 seconds (3684160080 bytes
>         > allocated, 65.51% gc time)
>         > 
>         
>         
>

Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Reply via email to