On Tuesday, February 11, 2014 10:35:35 AM UTC-5, Tim Holy wrote:
>
> Jutho, you may want to check out generic_matmatmul deep inside Base; it 
> implements a cache-friendly multiplication algorithm. It's possible that 
> you'll find it's faster yet. 
>

generic_matmatmul only does one level of blocking, for a guessed L1 size, 
and is not particularly sophisticated about it.  So it's still not anywhere 
close to competitive with an optimized BLAS.

(In my experience, you can get good usage of the L1 and L2 caches using a 
simple recursive cache-oblivious algorithm (with a large enough base case 
to amortize the recursion overhead).  The hard part, the reason that an 
optimized BLAS requires tens of thousands of lines of code, is the lowest 
level: getting optimal usage of the registers and the instruction 
pipeline.  This requires lots of optimized straight-line/unrolled code for 
different little matrix multiplies (on the order of 10x10, although the 
optimum block size for the registers is actually non-square because of the 
difference in load and store latency), combined with the necessity of using 
SIMD instructions at this level.)

Reply via email to