Hi, In optimizing an inner-loop statistics computation with fairly large matrices (matmul's of 1024x10.000 * 10.000x60 are not uncommon) I have to decide where to put some last unavoidable transpositions. In timing I have noticed that cutting out an intermediate result makes quite a difference, apparently more than a transposition-in-multiplication. Specifically, It seems that
a' * b a' * c runs somewhat faster, and uses less memory than at = a' at * b at * c Does anybody know if `a' * b` is translated to a single BLAS call with the correct transposition options set, or that this is translated to a transposition followed by a BLAS call? The fact that the first option allocates less memory than the second suggests the former. Cheers, ---david
