With array views, we should be able to avoid writing loops. Hopefully we will tackle this early in 0.4.
-viral On May 28, 2014, at 1:05 AM, Andreas Noack Jensen <[email protected]> wrote: > Have you tried to time the recursive algorithm without BLAS calls against the > noblas version? > > I agree that would be great if it wasn't necessary to write out everything in > loops to get high performance. Hopefully, this can improve over time. > > > 2014-05-27 21:00 GMT+02:00 Viral Shah <[email protected]>: > This is really fascinating! There is certainly interest in supporting linear > algebra operations for types other than the ones BLAS/LAPACK support. > > Do see the discussion in the issue below. I know that you do not like github > much, but seeing the discussion should not require you to have a login. > https://github.com/JuliaLang/julia/issues/3014 > > -viral > > > On Tuesday, May 27, 2014 5:06:30 AM UTC+5:30, Jason Riedy wrote: > First, the caveats: > on large matrices, > with few threads, and > punching a hole directly to the BLAS library. > Sivan Toledo's recursive LU factorization ( > http://dx.doi.org/10.1137/S0895479896297744 ) isn't just good for optimizing > cache behavior but also avoiding top-level overheads. A somewhat > straight-forward iterative implementation in Julia, attached, performs about > as quickly as the current LAPACK call for square matrices of dimension 3000 > or above so long as you limit OpenBLAS to a single thread. I haven't tested > against MKL. > Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM and > xTRSV directly. The allocations and construction of StridedMatrix views at > every step destroys any performance. Similarly with a light-weight wrapper > attempt called LASubDomain below. Also, I have to write loops rather than > vector expressions. That leads to ugly code compared to modern Fortran > versions. > Current ratio of recursive LU time to the Julia LAPACK dispatch on a > dual-core, Sandy Bridge server by dimension (vertical) and number of threads > (horizontal): > N NT: 1 2 4 8 16 24 32 > 10 2301.01 2291.16 2335.39 2268.1 2532.12 2304.23 2320.84 > 30 661.25 560.34 443.07 514.52 392.04 425.56 472 > 100 98.62 90.38 96.69 94.31 74.6 73.61 76.33 > 300 8.12 8.14 12.19 16.15 14.55 14.16 15.82 > 1000 1.6 1.55 2.36 3.35 3.5 3.86 3.48 > 3000 1.04 1.13 1.36 1.99 1.62 1.36 1.39 > 10000 1.02 1.06 1.13 1.26 1.55 1.54 1.58 > The best case for traditional LU factorization relying on xGEMM for the Schur > complement is 7.7x slower than the LAPACK dispatch, and it scales far worse. > If there's interest, I'd love to work this into a faster generic > base/linalg/lu.jl for use with other data types. I know I need to consider if > the laswap!, lutrisolve!, and luschur! functions are the right utility > methods. I'm interested in supporting doubled double, etc. with relatively > high performance. That implies building everything out of at least dot > products. For "multiplied" precisions, those can be optimized to avoid > continual re-normalization (see Siegfried Rump, et al.'s work). > (I actually wrote the code last November, but this is first chance I've had > to describe it. Might be a tad out of date style-wise with respect to current > Julia. Given that lag, I suppose I shouldn't promise to beat this into shape > at any fast paceā¦) > <#part type="text/plain" filename="~/src/julia-stuff/reclu/reclu.jl" > disposition=inline> > <#/part> > <#part type="text/plain" filename="~/src/julia-stuff/reclu/tst.jl" > disposition=inline> > <#/part> > > > > -- > Med venlig hilsen > > Andreas Noack Jensen
