Thanks, now the 2-norm calculation for sparse matrices of cosine() can be optimized by your sumsq() function. But is dot product part of cosine() for sparse matrices optimizable?
On Tuesday, November 11, 2014 6:57:44 PM UTC+8, Milan Bouchet-Valat wrote: > > Le lundi 10 novembre 2014 à 19:06 -0800, Todd Leo a écrit : > > I do, actually, tried expanding vectorized operations into explicit for > loops, and computing vector multiplication / vector norm in BLAS > interfaces. For explicit loops, it did allocate less memory, but took much > more time. Meanwhile, the vectorized version which I've been get used to > write runs incredibly fast, as the following tests indicates: > > I don't have a typical matrix like yours to test it, but your cosine() > function is adapted from an algorithm for dense matrices. If your matrix is > really sparse, you can do the computations only for the nonzero entries > when multiplying i by itself and j by itself. I've put together a short > function to compute the sum of squares here: > https://gist.github.com/nalimilan/3ab9922294663b2ec979 > > BTW, looks like something like that is needed in Base to specialize > sumabs2() for sparse matrices, as the current generic version is terribly > slow in that case. > > It could easily be extended to compute i .* j if the nonzeros entries in > both matrices match. If not, more work is needed, but you could adapt the > .* function from [1] to compute the sum on the fly instead of allocating > space for a new matrix. > > > Regards > > 1: > https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530 > > > # Explicit for loop, slightly modified from SimilarityMetric.jl by > johnmyleswhite ( > https://github.com/johnmyleswhite/SimilarityMetrics.jl/blob/master/src/cosine.jl > ) > function cosine(a::SparseMatrixCSC{Float64, Int64}, > b::SparseMatrixCSC{Float64, Int64}) > > sA, sB, sI = 0.0, 0.0, 0.0 > > for i in 1:length(a) > > sA += a[i]^2 > > sI += a[i] * b[i] > > end > > for i in 1:length(b) > > sB += b[i]^2 > > end > > return sI / sqrt(sA * sB) > > end > > # BLAS version > > function cosine_blas(i::SparseMatrixCSC{Float64, Int64}, > j::SparseMatrixCSC{Float64, Int64}) > > i = full(i) > > j = full(j) > > numerator = BLAS.dot(i, j) > > denominator = BLAS.nrm2(i) * BLAS.nrm2(j) > > return numerator / denominator > > end > > # the vectorized version remains the same, as the 1st post shows. > > # Test functions > > function test_explicit_loop(d) > > for n in 1:10000 > > v = d[:,1] > > cosine(v,v) > > end > > end > > > > function test_blas(d) > > for n in 1:10000 > > v = d[:,1] > > cosine_blas(v,v) > > end > > end > > > > function test_vectorized(d) > > for n in 1:10000 > > v = d[:,1] > > cosine_vectorized(v,v) > > end > > end > > > > test_explicit_loop(mat) > > test_blas(mat) > > test_vectorized(mat) > > gc() > > @time test_explicit_loop(mat) > > gc() > > @time test_blas(mat) > > gc() > > @time test_vectorized(mat) > > # Results > elapsed time: 3.772606858 seconds (6240080 bytes allocated) > elapsed time: 0.400972089 seconds (327520080 bytes allocated, 81.58% gc > time) > elapsed time: 0.011236068 seconds (34560080 bytes allocated) > > > > > On Monday, November 10, 2014 7:23:17 PM UTC+8, Milan Bouchet-Valat wrote: > > Le dimanche 09 novembre 2014 à 21:17 -0800, Todd Leo a écrit : > > Hi fellows, > > > I'm currently working on sparse matrix and cosine similarity computation, > but my routines is running very slow, at least not reach my expectation. So > I wrote some test functions, to dig out the reason of ineffectiveness. To > my surprise, the execution time of passing two vectors to the test function > and passing the whole sparse matrix differs greatly, the latter is 80x > faster. I am wondering why extracting two vectors of the matrix in each > loop is dramatically faster that much, and how to avoid the multi-GB memory > allocate. Thanks guys. > > > -- > BEST REGARDS, > Todd Leo > > > # The sparse matrix > mat # 2000x15037 SparseMatrixCSC{Float64, Int64} > > > # The two vectors, prepared in advance > v = mat'[:,1] > w = mat'[:,2] > > > # Cosine similarity function > function cosine_vectorized(i::SparseMatrixCSC{Float64, Int64}, > j::SparseMatrixCSC{Float64, Int64}) > return sum(i .* j)/sqrt(sum(i.*i)*sum(j.*j)) > end > > I think you'll experience a dramatic speed gain if you write the sums in > explicit loops, accessing elements one by one, taking their product and > adding it immediately to a counter. In your current version, the > element-wise products allocate new vectors before computing the sums, which > is very costly. > > This will also get rid of the difference you report between passing arrays > and vectors. > > > Regards > > function test1(d) > res = 0. > for i in 1:10000 > res = cosine_vectorized(d[:,1], d[:,2]) > end > end > > > function test2(_v,_w) > res = 0. > for i in 1:10000 > res = cosine_vectorized(_v, _w) > end > end > > > test1(dtm) > test2(v,w) > gc() > @time test1(dtm) > gc() > @time test2(v,w) > > > #elapsed time: 0.054925372 seconds (59360080 bytes allocated, 59.07% gc > time) > > #elapsed time: 4.204132608 seconds (3684160080 bytes allocated, 65.51% gc > time) > > > > >
