@Gene: if all matrices are of N x N , and my size of L1 cache is L1, then If I need a block of A and B to fit in cache would I need it as 2 x (blocksize )^2 x 8 ( bytes per data item-for example) = L1 ?? or would I need it as 3 ( including C block ) x (blocksize)^2 x 8 = L1 ? Am confused. I tried a sample and I think I got somewhat good speedup for block size 32 ( for matrix dimension 512, 1024 etc )for my L1 of size 16 kbytes and L2 256 kbytes...Any comments or inferences?
On Wed, Feb 29, 2012 at 9:31 AM, Arun Vishwanathan <[email protected]>wrote: > @all: Thanks a lot > > > On Wed, Feb 29, 2012 at 9:02 AM, Gene <[email protected]> wrote: > >> For big matrices, using all the caches well is very important. The >> usual approach is block tiling as you say. You want two blocks to fit >> nicely in the L1 cache. The most highly optimized schemes will have a >> hierarchy of tiles where two tiles at the second level will fit in the >> L2 cache, etc. Additional levels can be based on memory interfaces, >> interleaving, page size, and even cylinder size on disk (for really >> huge matrices). The idea is _never_ to generate more cache misses than >> you need to. A miss causes a factor of 10 to 10000 performance >> multiple on that operation. >> >> Multiplying within the innermost blocks should also consider prefetch: >> you'll get best performance touching locations in contiguous ascending >> order. The inner loops are >> >> for i = 1 to ma >> for j = 1 to nb >> for k = 1 to na >> r[i,j] += a[i,k] * b[k,j] >> >> This ignores that r[i,j] needs to be zeroed out somewhere. But with >> this assumption, the loops can be juxtaposed in any order without >> changing the result. You should explore the 6 possible orderings for >> the best one. Of course you also have to zero out the sums in the >> best possible manner. >> >> FWIW, a GPU will normally outperform a general purpose CPU with ease >> on this problem. Since even cell phones are getting GPUs these days, >> tweaking single-processor matrix code is a dying art. >> >> Cheers. >> >> On Feb 27, 6:57 pm, Arun Vishwanathan <[email protected]> wrote: >> > Hi all, >> > >> > We have this challenge to make the fastest executing serial matrix >> > multiplication code. I have tried using matrix transpose( in C for row >> > major ) and loop unrolling.I was able to obtain little speedup. Does >> anyone >> > have any hints/papers that I could read upon and try to speed up >> further?I >> > had tried a bit of block tiling but was not successful. >> > >> > Thanks >> > Arun >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Algorithm Geeks" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/algogeeks?hl=en. >> >> > > > -- > "People often say that motivation doesn't last. Well, neither does > bathing - that's why we recommend it daily." > > -- "People often say that motivation doesn't last. Well, neither does bathing - that's why we recommend it daily." -- You received this message because you are subscribed to the Google Groups "Algorithm Geeks" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/algogeeks?hl=en.
