Hi, 

In optimizing an inner-loop statistics computation with fairly large 
matrices (matmul's of 1024x10.000 * 10.000x60 are not uncommon) I have to 
decide where to put some last unavoidable transpositions.  In timing I have 
noticed that cutting out an intermediate result makes quite a difference, 
apparently more than a transposition-in-multiplication.  Specifically, It 
seems that

a' * b
a' * c

runs somewhat faster, and uses less memory than

at = a'
at * b
at * c

Does anybody know if `a' * b` is translated to a single BLAS call with the 
correct transposition options set, or that this is translated to a 
transposition followed by a BLAS call? 

The fact that the first option allocates less memory than the second 
suggests the former.  

Cheers, 

---david

Reply via email to