It can be quite large. With
julia> function mymul(A,B)
m, n = size(A, 1), size(B, 2)
C = promote_type(typeof(A), typeof(B))(m,n)
for j = 1:n
for i = 1:m
tmp = zero(eltype(C)); for k = 1:size(A, 2)
tmp += A[i,k]*B[k,j]
end
C[i,j] = tmp
end
end
return C
end
I get that single threaded OpenBLAS speed-up of
size factor
2 1.16176
4 0.515929
8 1.73846
16 4.80873
32 10.4425
64 11.6411
128 20.1504
256 41.6211
512 38.4489
1024 136.855
2015-07-08 10:46 GMT-04:00 Josh Langsfeld <[email protected]>:
> Ah, thanks, that's good to know. I was under the mistaken impression that
> loops are always the fastest option in Julia since it's brought up pretty
> frequently. Out of curiosity, what factor of slow-down would not using the
> optimized routines cause?
>
> On Wed, Jul 8, 2015 at 10:39 AM, Andreas Noack <
> [email protected]> wrote:
>
>> You could, but unless the matrices are small, it would be slower because
>> it wouldn't use optimized matrix multiplication.
>>
>> 2015-07-08 10:36 GMT-04:00 Josh Langsfeld <[email protected]>:
>>
>>> Maybe I'm missing something obvious, but couldn't you easily write your
>>> own 'cross' function that uses a couple nested for-loops to do the
>>> arithmetic without any intermediate allocations at all?
>>>
>>> On Tuesday, July 7, 2015 at 6:24:34 PM UTC-4, Matthieu wrote:
>>>>
>>>> Thanks, this is what I currently do :)
>>>>
>>>> However, I'd like to find a solution that is both memory efficient (X
>>>> can be very large) and which does not modify X in place.
>>>>
>>>> Basically, I'm wondering whether there was a BLAS subroutine that would
>>>> allow to compute cross(X, w, Y) in one pass without creating an
>>>> intermediate matrix as large as X or Y.
>>>>
>>>>
>>
>