@Gene: if all matrices are of N x N , and my size of L1 cache is L1, then
If I need a block of A and B to fit in cache would I need it as 2 x
(blocksize )^2 x 8 ( bytes per data item-for example) = L1 ?? or would I
need it as 3 ( including C block ) x (blocksize)^2 x 8 = L1 ? Am confused.
I tried a sample and I think I got somewhat good speedup for block size 32
( for matrix dimension 512, 1024 etc )for my L1 of size 16 kbytes and L2
256 kbytes...Any comments or inferences?


On Wed, Feb 29, 2012 at 9:31 AM, Arun Vishwanathan
<[email protected]>wrote:

> @all: Thanks a lot
>
>
> On Wed, Feb 29, 2012 at 9:02 AM, Gene <[email protected]> wrote:
>
>> For big matrices, using all the caches well is very important.  The
>> usual approach is block tiling as you say.  You want two blocks to fit
>> nicely in the L1 cache.  The most highly optimized schemes will have a
>> hierarchy of tiles where two tiles at the second level will fit in the
>> L2 cache, etc. Additional levels can be based on memory interfaces,
>> interleaving, page size, and even cylinder size on disk (for really
>> huge matrices). The idea is _never_ to generate more cache misses than
>> you need to. A miss causes a factor of 10 to 10000 performance
>> multiple on that operation.
>>
>> Multiplying within the innermost blocks should also consider prefetch:
>> you'll get best performance touching locations in contiguous ascending
>> order.  The inner loops are
>>
>> for i = 1 to ma
>>  for j = 1 to nb
>>    for k = 1 to na
>>      r[i,j] += a[i,k] * b[k,j]
>>
>> This ignores that r[i,j] needs to be zeroed out somewhere.  But with
>> this assumption, the loops can be juxtaposed in any order without
>> changing the result. You should explore the 6 possible orderings for
>> the best one.  Of course you also have to zero out the sums in the
>> best possible manner.
>>
>> FWIW, a GPU will normally outperform a general purpose CPU with ease
>> on this problem.  Since even cell phones are getting GPUs these days,
>> tweaking single-processor matrix code is a dying art.
>>
>> Cheers.
>>
>> On Feb 27, 6:57 pm, Arun Vishwanathan <[email protected]> wrote:
>> > Hi all,
>> >
>> > We have this challenge to make the fastest executing serial matrix
>> > multiplication code. I have tried using matrix transpose( in C for row
>> > major ) and loop unrolling.I was able to obtain little speedup. Does
>> anyone
>> > have any hints/papers that I could read upon and try to speed up
>> further?I
>> > had tried a bit of block tiling but was not successful.
>> >
>> > Thanks
>> > Arun
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Algorithm Geeks" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/algogeeks?hl=en.
>>
>>
>
>
> --
>  "People often say that motivation doesn't last. Well, neither does
> bathing - that's why we recommend it daily."
>
>


-- 
 "People often say that motivation doesn't last. Well, neither does bathing
- that's why we recommend it daily."

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.

Reply via email to