On Wednesday, October 22, 2014 02:27:30 AM Ján Dolinský wrote: > I wonder why my vectorized code is not fast. I assume operations like "X' * > a" are using BLAS (e.g. BLAS.gemv()) and thus it should be as fast as a > devectorized code at least in this case.
Yes, but their result requires allocating a temporary, and temporaries mean garbage collection. Try - pre-allocating a couple of temp vectors to store the results of your matrix multiplies, and use functions like `Ac_mul_B!(dest, A, B)` to calculate A'*B and store the result in dest. - Do the subtractions & divisions using a loop, as suggested by Steven. - in each iteration you should call `view(X, :, l)` just once and store the result to a variable---no point recreating the same view 3 times per iteration. --Tim
