Dňa streda, 22. októbra 2014 11:56:31 UTC+2 Tim Holy napísal(-a):
>
> On Wednesday, October 22, 2014 02:27:30 AM Ján Dolinský wrote:
> > I wonder why my vectorized code is not fast. I assume operations like
> "X' *
> > a" are using BLAS (e.g. BLAS.gemv()) and thus it should be as fast as a
> > devectorized code at least in this case.
>
> Yes, but their result requires allocating a temporary, and temporaries
> mean
> garbage collection. Try
> - pre-allocating a couple of temp vectors to store the results of your
> matrix
> multiplies, and use functions like `Ac_mul_B!(dest, A, B)` to calculate
> A'*B
> and store the result in dest.
> - Do the subtractions & divisions using a loop, as suggested by Steven.
> - in each iteration you should call `view(X, :, l)` just once and store
> the
> result to a variable---no point recreating the same view 3 times per
> iteration.
>
> --Tim
>
Hello Tim,
Thanks for the tips. I rewrote a bit the mgs routine so that it is more
readable and did profiling.
using ArrayViews
function mgs(X)
# mgs speed project
nobs, nvars = size(X)
R = eye(nvars)
for l=1:nvars-1
v_i = view(X,:,l+1:nvars) # --- line 10
s = view(X,:,l);
R[l,l+1:nvars]=(v_i'*s)/sumabs2(s) # --- line 13
X[:,l+1:nvars]=v_i-(s*view(R,l,l+1:nvars)) # --- line 14
end
X, R
end
Profilling is as follows:
1 linalg/blas.jl; gemm!; line: 526
10455 task.jl; anonymous; line: 96
10455 REPL.jl; eval_user_input; line: 54
10455 profile.jl; anonymous; line: 14
2 ...me/Projects/julia-trials/mgs.jl; mgs; line: 7
2 array.jl; eye; line: 176
2 array.jl; fill!; line: 155
2383 ...me/Projects/julia-trials/mgs.jl; mgs; line: 10
*41 ...me/Projects/julia-trials/mgs.jl; mgs; line: 13*
3 array.jl; ./; line: 756
7 array.jl; ./; line: 758
1 linalg/matmul.jl; gemv!; line: 217
1 pointer.jl; pointer; line: 29
16 multidimensional.jl; setindex!; line: 76
3 multidimensional.jl; setindex!; line: 295
6 reduce.jl; _mapreduce; line: 168
*8027 ...me/Projects/julia-trials/mgs.jl; mgs; line: 14*
31 ...3/ArrayViews/src/ArrayViews.jl; strided_view; line: 72
1 abstractarray.jl; copy!; line: 152
3109 array.jl; -; line: 719
2770 array.jl; -; line: 721
5 linalg/matmul.jl; *; line: 74
1 abstractarray.jl; copy!; line: 150
1807 multidimensional.jl; setindex!; line: 76
1 multidimensional.jl; setindex!; line: 80
303 multidimensional.jl; setindex!; line: 295
1 abstractarray.jl; stride; line: 30
There is apparently a huge difference between line 13 and 14.
Line 14 indeed deserves a loop as suggested by Steven. The culprit seems to
be the minus operation. I'll explore the problem further.
Line 13 seems to be already reasonably fast but preallocating a space for
output of (v_i'*s) and using something like "Ac_mul_B!(dest, v_i, s)" might
provide further speedup.
It is very interesting, I'll play with it a bit further and post the
improvements.
Jan