Have you tried to time the recursive algorithm without BLAS calls against
the noblas version?

I agree that would be great if it wasn't necessary to write out everything
in loops to get high performance. Hopefully, this can improve over time.


2014-05-27 21:00 GMT+02:00 Viral Shah <[email protected]>:

> This is really fascinating! There is certainly interest in supporting
> linear algebra operations for types other than the ones BLAS/LAPACK support.
>
> Do see the discussion in the issue below. I know that you do not like
> github much, but seeing the discussion should not require you to have a
> login.
> https://github.com/JuliaLang/julia/issues/3014
>
> -viral
>
>
> On Tuesday, May 27, 2014 5:06:30 AM UTC+5:30, Jason Riedy wrote:
>>
>> First, the caveats:
>>
>>    - on large matrices,
>>     - with few threads, and
>>     - punching a hole directly to the BLAS library.
>>
>>  Sivan Toledo's recursive LU factorization ( http://dx.doi.org/10.1137/
>> S0895479896297744 ) isn't just good for optimizing cache behavior but
>> also avoiding top-level overheads. A somewhat straight-forward iterative
>> implementation in Julia, attached, performs about as quickly as the current
>> LAPACK call for square matrices of dimension 3000 or above so long as you
>> limit OpenBLAS to a single thread. I haven't tested against MKL.
>>
>> Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM
>> and xTRSV directly. The allocations and construction of StridedMatrix views
>> at every step destroys any performance. Similarly with a light-weight
>> wrapper attempt called LASubDomain below. Also, I have to write loops
>> rather than vector expressions. That leads to ugly code compared to modern
>> Fortran versions.
>>
>> Current ratio of recursive LU time to the Julia LAPACK dispatch on a
>> dual-core, Sandy Bridge server by dimension (vertical) and number of
>> threads (horizontal):
>>        N NT: 1 2 4 8 16 24 32   10 2301.01 2291.16 2335.39 2268.1 2532.12
>> 2304.23 2320.84  30 661.25 560.34 443.07 514.52 392.04 425.56 472  100
>> 98.62 90.38 96.69 94.31 74.6 73.61 76.33  300 8.12 8.14 12.19 16.15 14.55
>> 14.16 15.82  1000 1.6 1.55 2.36 3.35 3.5 3.86 3.48  3000 1.04 1.13 1.36
>> 1.99 1.62 1.36 1.39  10000 1.02 1.06 1.13 1.26 1.55 1.54 1.58
>>
>> The best case for traditional LU factorization relying on xGEMM for the
>> Schur complement is 7.7x slower than the LAPACK dispatch, and it scales far
>> worse.
>>
>> If there's interest, I'd love to work this into a faster generic
>> base/linalg/lu.jl for use with other data types. I know I need to consider
>> if the laswap!, lutrisolve!, and luschur! functions are the right
>> utility methods. I'm interested in supporting doubled double, etc. with
>> relatively high performance. That implies building everything out of at
>> least dot products. For "multiplied" precisions, those can be optimized to
>> avoid continual re-normalization (see Siegfried Rump, et al.'s work).
>>
>> (I actually wrote the code last November, but this is first chance I've
>> had to describe it. Might be a tad out of date style-wise with respect to
>> current Julia. Given that lag, I suppose I shouldn't promise to beat this
>> into shape at any fast pace…)
>>
>> <#part type="text/plain" filename="~/src/julia-stuff/reclu/reclu.jl"
>> disposition=inline>
>> <#/part>
>> <#part type="text/plain" filename="~/src/julia-stuff/reclu/tst.jl"
>> disposition=inline>
>> <#/part>
>>
>


-- 
Med venlig hilsen

Andreas Noack Jensen

Reply via email to