Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Milan Bouchet-Valat Tue, 11 Nov 2014 23:41:25 -0800

Le mardi 11 novembre 2014 à 18:15 -0800, Tony Kelman a écrit :
> Milan,
> 
> 
> A useful trick for linking to lines of code on github is to hit the
> "y" key, and github will reload the page you're on but with the
> specific sha in the url - like so
> https://github.com/JuliaLang/julia/blob/d0a951ccb3a7ebae7909665f4445a019f2ee54a1/base/sparse/sparsematrix.jl#L530
> 
> 
> That way the link will still make sense in the future even if the
> contents of that file get moved around.
Right, good trick.



Regards

> On Tuesday, November 11, 2014 2:57:44 AM UTC-8, Milan Bouchet-Valat
> wrote:
>         Le lundi 10 novembre 2014 à 19:06 -0800, Todd Leo a écrit : 
>         > I do, actually, tried expanding vectorized operations into
>         > explicit for loops, and computing vector multiplication /
>         > vector norm in BLAS interfaces. For explicit loops, it did
>         > allocate less memory, but took much more time. Meanwhile,
>         > the vectorized version which I've been get used to write
>         > runs incredibly fast, as the following tests indicates:
>         I don't have a typical matrix like yours to test it, but your
>         cosine() function is adapted from an algorithm for dense
>         matrices. If your matrix is really sparse, you can do the
>         computations only for the nonzero entries when multiplying i
>         by itself and j by itself. I've put together a short function
>         to compute the sum of squares here:
>         https://gist.github.com/nalimilan/3ab9922294663b2ec979
>         
>         BTW, looks like something like that is needed in Base to
>         specialize sumabs2() for sparse matrices, as the current
>         generic version is terribly slow in that case.
>         
>         It could easily be extended to compute i .* j if the nonzeros
>         entries in both matrices match. If not, more work is needed,
>         but you could adapt the .* function from [1] to compute the
>         sum on the fly instead of allocating space for a new matrix.
>         
>         
>         Regards
>         
>         1:
>         
> https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530
>  
>         > # Explicit for loop, slightly modified from
>         > SimilarityMetric.jl by johnmyleswhite
>         > 
> (https://github.com/johnmyleswhite/SimilarityMetrics.jl/blob/master/src/cosine.jl)
>         > function cosine(a::SparseMatrixCSC{Float64, Int64},
>         > b::SparseMatrixCSC{Float64, Int64})
>         > sA, sB, sI = 0.0, 0.0, 0.0
>         > for i in 1:length(a)
>         > sA += a[i]^2
>         > sI += a[i] * b[i]
>         > end
>         > for i in 1:length(b)
>         > sB += b[i]^2
>         > end
>         > return sI / sqrt(sA * sB)
>         > end
>         > 
>         > # BLAS version
>         > function cosine_blas(i::SparseMatrixCSC{Float64, Int64},
>         > j::SparseMatrixCSC{Float64, Int64})
>         >     i = full(i)
>         >     j = full(j)
>         >     numerator = BLAS.dot(i, j)
>         >     denominator = BLAS.nrm2(i) * BLAS.nrm2(j)
>         >     return numerator / denominator
>         > end
>         > 
>         > # the vectorized version remains the same, as the 1st post
>         > shows.
>         > 
>         > # Test functions
>         > function test_explicit_loop(d)
>         >         for n in 1:10000
>         >             v = d[:,1]
>         >             cosine(v,v)
>         >         end
>         >     end
>         >   
>         >     function test_blas(d)
>         >         for n in 1:10000
>         >             v = d[:,1]
>         >             cosine_blas(v,v)
>         >         end
>         >     end
>         >   
>         >     function test_vectorized(d)
>         >         for n in 1:10000
>         >             v = d[:,1]
>         >             cosine_vectorized(v,v)
>         >         end
>         >     end
>         > 
>         > 
>         >     test_explicit_loop(mat)
>         >     test_blas(mat)
>         >     test_vectorized(mat)
>         >     gc()
>         >     @time test_explicit_loop(mat)
>         >     gc()
>         >     @time test_blas(mat)
>         >     gc()
>         >     @time test_vectorized(mat)
>         > 
>         > # Results
>         > elapsed time: 3.772606858 seconds (6240080 bytes allocated)
>         > elapsed time: 0.400972089 seconds (327520080 bytes
>         > allocated, 81.58% gc time)
>         > elapsed time: 0.011236068 seconds (34560080 bytes allocated)
>         > 
>         > 
>         > 
>         > 
>         > On Monday, November 10, 2014 7:23:17 PM UTC+8, Milan
>         > Bouchet-Valat wrote:
>         >         Le dimanche 09 novembre 2014 à 21:17 -0800, Todd Leo
>         >         a écrit : 
>         >         > Hi fellows, 
>         >         > 
>         >         > 
>         >         > I'm currently working on sparse matrix and cosine
>         >         > similarity computation, but my routines is running
>         >         > very slow, at least not reach my expectation. So I
>         >         > wrote some test functions, to dig out the reason
>         >         > of ineffectiveness. To my surprise, the execution
>         >         > time of passing two vectors to the test function
>         >         > and passing the whole sparse matrix differs
>         >         > greatly, the latter is 80x faster. I am wondering
>         >         > why extracting two vectors of the matrix in each
>         >         > loop is dramatically faster that much, and how to
>         >         > avoid the multi-GB memory allocate. Thanks guys.
>         >         > 
>         >         > 
>         >         > --
>         >         > BEST REGARDS,
>         >         > Todd Leo
>         >         > 
>         >         > 
>         >         > # The sparse matrix
>         >         > mat # 2000x15037 SparseMatrixCSC{Float64, Int64}
>         >         > 
>         >         > 
>         >         > # The two vectors, prepared in advance
>         >         > v = mat'[:,1]
>         >         > w = mat'[:,2]
>         >         > 
>         >         > 
>         >         > # Cosine similarity function
>         >         > function
>         >         > cosine_vectorized(i::SparseMatrixCSC{Float64,
>         >         > Int64}, j::SparseMatrixCSC{Float64, Int64})
>         >         >     return sum(i .* j)/sqrt(sum(i.*i)*sum(j.*j))
>         >         > end
>         >         I think you'll experience a dramatic speed gain if
>         >         you write the sums in explicit loops, accessing
>         >         elements one by one, taking their product and adding
>         >         it immediately to a counter. In your current
>         >         version, the element-wise products allocate new
>         >         vectors before computing the sums, which is very
>         >         costly.
>         >         
>         >         This will also get rid of the difference you report
>         >         between passing arrays and vectors.
>         >         
>         >         
>         >         Regards
>         >         
>         >         > function test1(d)
>         >         >     res = 0.
>         >         >     for i in 1:10000
>         >         >         res = cosine_vectorized(d[:,1], d[:,2])
>         >         >     end
>         >         > end
>         >         > 
>         >         > 
>         >         > function test2(_v,_w)
>         >         >     res = 0.
>         >         >     for i in 1:10000
>         >         >         res = cosine_vectorized(_v, _w)
>         >         >     end
>         >         > end
>         >         > 
>         >         > 
>         >         > test1(dtm)
>         >         > test2(v,w)
>         >         > gc()
>         >         > @time test1(dtm)
>         >         > gc()
>         >         > @time test2(v,w)
>         >         > 
>         >         > 
>         >         > #elapsed time: 0.054925372 seconds (59360080 bytes
>         >         > allocated, 59.07% gc time)
>         >         > 
>         >         > #elapsed time: 4.204132608 seconds (3684160080
>         >         > bytes allocated, 65.51% gc time)
>         >         > 
>         >         
>         >         
>         
>

Re: [julia-users] Performance confusions on matrix extractions in loops, and memory allocations

Reply via email to