On Wednesday, October 22, 2014 02:27:30 AM Ján Dolinský wrote:
> I wonder why my vectorized code is not fast. I assume operations like "X' * 
> a" are using BLAS (e.g. BLAS.gemv()) and thus it should be as fast as a
> devectorized code at least in this case.

Yes, but their result requires allocating a temporary, and temporaries mean 
garbage collection. Try
- pre-allocating a couple of temp vectors to store the results of your matrix 
multiplies, and use functions like `Ac_mul_B!(dest, A, B)` to calculate A'*B 
and store the result in dest. 
- Do the subtractions & divisions using a loop, as suggested by Steven. 
- in each iteration you should call `view(X, :, l)` just once and store the 
result to a variable---no point recreating the same view 3 times per 
iteration.

--Tim

Reply via email to