On Wednesday, September 3, 2014 6:46:07 PM UTC-5, Giuseppe Ragusa wrote:
>
> I have been struggling to find a fast and elegant way to implement the
> following algorithm.
>
[...]
> function block_crossprod!(out, X, cl)
> for j in distinct(cl)
> out += X[find(cl.==j),:]'*X[find(cl.==j),:]
> end
> end
>
Hi Giuseppe, hope you're doing well! I think that "+=" creates a
temporary array, even though the notation suggests that it's
mutating. The BLAS function "syrk!" does essentially what you
want but without creating the temporary variables[1]. i.e.:
```
function block_crossprod2!(out, X, cl)
for j in unique(cl)
Base.BLAS.syrk!('U', 'T', 1., X[cl .== j,:], 1., out)
end
end
```
out will only return the upper triangular part, but
Symmetric(out) will make it symmetric.
If you know more about the indexing scheme, you may be able to
improve things more by replacing X[cl .== j,:] with a subarray:
```
function block_crossprod3!(out, X, bstarts, blengths)
for j in 1:length(bstarts)
Base.BLAS.syrk!('U', 'T', 1.,
sub(X, range(bstarts[j], blengths[j]),:), 1., out)
end
end
```
Other people may have more improvements, or know
how to use subarrays in the general case. Hope this helps!
A quick profile:
```
X = randn(10_000,200)
cl=rep(1:500, 20)
out1 = zeros(200,200)
bstarts = 1:20:10_000
blengths = rep(20, 500)
@time block_crossprod!(out1, X, cl)
# elapsed time: 0.611584694 seconds (358190568 bytes allocated, 41.52% gc
time)
@time block_crossprod2!(out1, X, cl)
# elapsed time: 0.179214852 seconds (19066568 bytes allocated, 20.01% gc
time)
@time block_crossprod3!(out1, X, bstarts, blengths)
# elapsed time: 0.095740838 seconds (390416 bytes allocated)
```
The memory allocation for the last version is kind of astonishing.
[1]:
http://julia.readthedocs.org/en/latest/stdlib/linalg/#Base.LinAlg.BLAS.syrk!