On Tuesday, February 11, 2014 10:35:35 AM UTC-5, Tim Holy wrote: > > Jutho, you may want to check out generic_matmatmul deep inside Base; it > implements a cache-friendly multiplication algorithm. It's possible that > you'll find it's faster yet. >
generic_matmatmul only does one level of blocking, for a guessed L1 size, and is not particularly sophisticated about it. So it's still not anywhere close to competitive with an optimized BLAS. (In my experience, you can get good usage of the L1 and L2 caches using a simple recursive cache-oblivious algorithm (with a large enough base case to amortize the recursion overhead). The hard part, the reason that an optimized BLAS requires tens of thousands of lines of code, is the lowest level: getting optimal usage of the registers and the instruction pipeline. This requires lots of optimized straight-line/unrolled code for different little matrix multiplies (on the order of 10x10, although the optimum block size for the registers is actually non-square because of the difference in load and store latency), combined with the necessity of using SIMD instructions at this level.)
