>
> I'd like to approach this speed at the end.

I don't think it is possible in Julia right now without using dirty tricks
such as passing pointers around. You'd like to get the speed from BLAS by
operating on panels of your matrix, but you'd like to avoid the copying and
reallocation of arrays. If you devectorised your code completely, there'd
be no allocation, but you wouldn't have the fast multiplications in BLAS.
You can get some of the way by using one of the ArrayViews packages, but
the garbage collector will charge a fee for each view you produce so LAPACK
speed is not attainable.

Med venlig hilsen

Andreas Noack

2014-10-22 10:09 GMT-04:00 Ján Dolinský <[email protected]>:

>
>
> Dňa streda, 22. októbra 2014 11:56:31 UTC+2 Tim Holy napísal(-a):
>
>> On Wednesday, October 22, 2014 02:27:30 AM Ján Dolinský wrote:
>> > I wonder why my vectorized code is not fast. I assume operations like
>> "X' *
>> > a" are using BLAS (e.g. BLAS.gemv()) and thus it should be as fast as a
>> > devectorized code at least in this case.
>>
>> Yes, but their result requires allocating a temporary, and temporaries
>> mean
>> garbage collection. Try
>> - pre-allocating a couple of temp vectors to store the results of your
>> matrix
>> multiplies, and use functions like `Ac_mul_B!(dest, A, B)` to calculate
>> A'*B
>> and store the result in dest.
>> - Do the subtractions & divisions using a loop, as suggested by Steven.
>> - in each iteration you should call `view(X, :, l)` just once and store
>> the
>> result to a variable---no point recreating the same view 3 times per
>> iteration.
>>
>> --Tim
>>
>
> Hello Tim,
>
> Thanks for the tips. I rewrote a bit the mgs routine so that it is more
> readable and did profiling.
>
> using ArrayViews
>
> function mgs(X)
> # mgs speed project
>
>   nobs, nvars = size(X)
>   R = eye(nvars)
>
>   for l=1:nvars-1
>     v_i = view(X,:,l+1:nvars) # --- line 10
>     s = view(X,:,l);
>
>     R[l,l+1:nvars]=(v_i'*s)/sumabs2(s) # --- line 13
>     X[:,l+1:nvars]=v_i-(s*view(R,l,l+1:nvars)) # --- line 14
>   end
>
>   X, R
>
> end
>
> Profilling is as follows:
> 1     linalg/blas.jl; gemm!; line: 526
> 10455 task.jl; anonymous; line: 96
>  10455 REPL.jl; eval_user_input; line: 54
>   10455 profile.jl; anonymous; line: 14
>    2    ...me/Projects/julia-trials/mgs.jl; mgs; line: 7
>     2 array.jl; eye; line: 176
>      2 array.jl; fill!; line: 155
>    2383 ...me/Projects/julia-trials/mgs.jl; mgs; line: 10
>    *41   ...me/Projects/julia-trials/mgs.jl; mgs; line: 13*
>     3  array.jl; ./; line: 756
>     7  array.jl; ./; line: 758
>     1  linalg/matmul.jl; gemv!; line: 217
>      1 pointer.jl; pointer; line: 29
>     16 multidimensional.jl; setindex!; line: 76
>     3  multidimensional.jl; setindex!; line: 295
>     6  reduce.jl; _mapreduce; line: 168
>    *8027 ...me/Projects/julia-trials/mgs.jl; mgs; line: 14*
>     31   ...3/ArrayViews/src/ArrayViews.jl; strided_view; line: 72
>     1    abstractarray.jl; copy!; line: 152
>     3109 array.jl; -; line: 719
>     2770 array.jl; -; line: 721
>     5    linalg/matmul.jl; *; line: 74
>      1 abstractarray.jl; copy!; line: 150
>     1807 multidimensional.jl; setindex!; line: 76
>     1    multidimensional.jl; setindex!; line: 80
>     303  multidimensional.jl; setindex!; line: 295
>    1    abstractarray.jl; stride; line: 30
>
> There is apparently a huge difference between line 13 and 14.
> Line 14 indeed deserves a loop as suggested by Steven. The culprit seems
> to be the minus operation. I'll explore the problem further.
> Line 13 seems to be already reasonably fast but preallocating a space for
> output of (v_i'*s) and using something like "Ac_mul_B!(dest, v_i, s)" might
> provide further speedup.
>
>
> It is very interesting, I'll play with it a bit further and post the
> improvements.
>
> Jan
>

Reply via email to