With array views, we should be able to avoid writing loops. Hopefully we will 
tackle this early in 0.4.

-viral


On May 28, 2014, at 1:05 AM, Andreas Noack Jensen 
<[email protected]> wrote:

> Have you tried to time the recursive algorithm without BLAS calls against the 
> noblas version?
> 
> I agree that would be great if it wasn't necessary to write out everything in 
> loops to get high performance. Hopefully, this can improve over time.
> 
> 
> 2014-05-27 21:00 GMT+02:00 Viral Shah <[email protected]>:
> This is really fascinating! There is certainly interest in supporting linear 
> algebra operations for types other than the ones BLAS/LAPACK support.
> 
> Do see the discussion in the issue below. I know that you do not like github 
> much, but seeing the discussion should not require you to have a login. 
> https://github.com/JuliaLang/julia/issues/3014
> 
> -viral
> 
> 
> On Tuesday, May 27, 2014 5:06:30 AM UTC+5:30, Jason Riedy wrote:
> First, the caveats:
> on large matrices,
> with few threads, and
> punching a hole directly to the BLAS library.
> Sivan Toledo's recursive LU factorization ( 
> http://dx.doi.org/10.1137/S0895479896297744 ) isn't just good for optimizing 
> cache behavior but also avoiding top-level overheads. A somewhat 
> straight-forward iterative implementation in Julia, attached, performs about 
> as quickly as the current LAPACK call for square matrices of dimension 3000 
> or above so long as you limit OpenBLAS to a single thread. I haven't tested 
> against MKL.
> Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM and 
> xTRSV directly. The allocations and construction of StridedMatrix views at 
> every step destroys any performance. Similarly with a light-weight wrapper 
> attempt called LASubDomain below. Also, I have to write loops rather than 
> vector expressions. That leads to ugly code compared to modern Fortran 
> versions.
> Current ratio of recursive LU time to the Julia LAPACK dispatch on a 
> dual-core, Sandy Bridge server by dimension (vertical) and number of threads 
> (horizontal):
> N     NT: 1   2       4       8       16      24      32
> 10    2301.01 2291.16 2335.39 2268.1  2532.12 2304.23 2320.84
> 30    661.25  560.34  443.07  514.52  392.04  425.56  472
> 100   98.62   90.38   96.69   94.31   74.6    73.61   76.33
> 300   8.12    8.14    12.19   16.15   14.55   14.16   15.82
> 1000  1.6     1.55    2.36    3.35    3.5     3.86    3.48
> 3000  1.04    1.13    1.36    1.99    1.62    1.36    1.39
> 10000 1.02    1.06    1.13    1.26    1.55    1.54    1.58
> The best case for traditional LU factorization relying on xGEMM for the Schur 
> complement is 7.7x slower than the LAPACK dispatch, and it scales far worse.
> If there's interest, I'd love to work this into a faster generic 
> base/linalg/lu.jl for use with other data types. I know I need to consider if 
> the laswap!, lutrisolve!, and luschur! functions are the right utility 
> methods. I'm interested in supporting doubled double, etc. with relatively 
> high performance. That implies building everything out of at least dot 
> products. For "multiplied" precisions, those can be optimized to avoid 
> continual re-normalization (see Siegfried Rump, et al.'s work).
> (I actually wrote the code last November, but this is first chance I've had 
> to describe it. Might be a tad out of date style-wise with respect to current 
> Julia. Given that lag, I suppose I shouldn't promise to beat this into shape 
> at any fast pace…)
> <#part type="text/plain" filename="~/src/julia-stuff/reclu/reclu.jl" 
> disposition=inline>
> <#/part>
> <#part type="text/plain" filename="~/src/julia-stuff/reclu/tst.jl" 
> disposition=inline>
> <#/part>
> 
> 
> 
> -- 
> Med venlig hilsen
> 
> Andreas Noack Jensen

Reply via email to