Have you tried to time the recursive algorithm without BLAS calls against the noblas version?
I agree that would be great if it wasn't necessary to write out everything in loops to get high performance. Hopefully, this can improve over time. 2014-05-27 21:00 GMT+02:00 Viral Shah <[email protected]>: > This is really fascinating! There is certainly interest in supporting > linear algebra operations for types other than the ones BLAS/LAPACK support. > > Do see the discussion in the issue below. I know that you do not like > github much, but seeing the discussion should not require you to have a > login. > https://github.com/JuliaLang/julia/issues/3014 > > -viral > > > On Tuesday, May 27, 2014 5:06:30 AM UTC+5:30, Jason Riedy wrote: >> >> First, the caveats: >> >> - on large matrices, >> - with few threads, and >> - punching a hole directly to the BLAS library. >> >> Sivan Toledo's recursive LU factorization ( http://dx.doi.org/10.1137/ >> S0895479896297744 ) isn't just good for optimizing cache behavior but >> also avoiding top-level overheads. A somewhat straight-forward iterative >> implementation in Julia, attached, performs about as quickly as the current >> LAPACK call for square matrices of dimension 3000 or above so long as you >> limit OpenBLAS to a single thread. I haven't tested against MKL. >> >> Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM >> and xTRSV directly. The allocations and construction of StridedMatrix views >> at every step destroys any performance. Similarly with a light-weight >> wrapper attempt called LASubDomain below. Also, I have to write loops >> rather than vector expressions. That leads to ugly code compared to modern >> Fortran versions. >> >> Current ratio of recursive LU time to the Julia LAPACK dispatch on a >> dual-core, Sandy Bridge server by dimension (vertical) and number of >> threads (horizontal): >> N NT: 1 2 4 8 16 24 32 10 2301.01 2291.16 2335.39 2268.1 2532.12 >> 2304.23 2320.84 30 661.25 560.34 443.07 514.52 392.04 425.56 472 100 >> 98.62 90.38 96.69 94.31 74.6 73.61 76.33 300 8.12 8.14 12.19 16.15 14.55 >> 14.16 15.82 1000 1.6 1.55 2.36 3.35 3.5 3.86 3.48 3000 1.04 1.13 1.36 >> 1.99 1.62 1.36 1.39 10000 1.02 1.06 1.13 1.26 1.55 1.54 1.58 >> >> The best case for traditional LU factorization relying on xGEMM for the >> Schur complement is 7.7x slower than the LAPACK dispatch, and it scales far >> worse. >> >> If there's interest, I'd love to work this into a faster generic >> base/linalg/lu.jl for use with other data types. I know I need to consider >> if the laswap!, lutrisolve!, and luschur! functions are the right >> utility methods. I'm interested in supporting doubled double, etc. with >> relatively high performance. That implies building everything out of at >> least dot products. For "multiplied" precisions, those can be optimized to >> avoid continual re-normalization (see Siegfried Rump, et al.'s work). >> >> (I actually wrote the code last November, but this is first chance I've >> had to describe it. Might be a tad out of date style-wise with respect to >> current Julia. Given that lag, I suppose I shouldn't promise to beat this >> into shape at any fast paceā¦) >> >> <#part type="text/plain" filename="~/src/julia-stuff/reclu/reclu.jl" >> disposition=inline> >> <#/part> >> <#part type="text/plain" filename="~/src/julia-stuff/reclu/tst.jl" >> disposition=inline> >> <#/part> >> > -- Med venlig hilsen Andreas Noack Jensen
