It can be quite large. With

julia> function mymul(A,B)
       m, n = size(A, 1), size(B, 2)
       C = promote_type(typeof(A), typeof(B))(m,n)
       for j = 1:n
       for i = 1:m
       tmp = zero(eltype(C)); for k = 1:size(A, 2)
       tmp += A[i,k]*B[k,j]
       end
       C[i,j] = tmp
       end
       end
       return C
       end

I get that single threaded OpenBLAS speed-up of

size factor
2         1.16176
4         0.515929
8         1.73846
16       4.80873
32       10.4425
64       11.6411
128     20.1504
256     41.6211
512     38.4489
1024 136.855

2015-07-08 10:46 GMT-04:00 Josh Langsfeld <[email protected]>:

> Ah, thanks, that's good to know. I was under the mistaken impression that
> loops are always the fastest option in Julia since it's brought up pretty
> frequently. Out of curiosity, what factor of slow-down would not using the
> optimized routines cause?
>
> On Wed, Jul 8, 2015 at 10:39 AM, Andreas Noack <
> [email protected]> wrote:
>
>> You could, but unless the matrices are small, it would be slower because
>> it wouldn't use optimized matrix multiplication.
>>
>> 2015-07-08 10:36 GMT-04:00 Josh Langsfeld <[email protected]>:
>>
>>> Maybe I'm missing something obvious, but couldn't you easily write your
>>> own 'cross' function that uses a couple nested for-loops to do the
>>> arithmetic without any intermediate allocations at all?
>>>
>>> On Tuesday, July 7, 2015 at 6:24:34 PM UTC-4, Matthieu wrote:
>>>>
>>>> Thanks, this is what I currently do :)
>>>>
>>>> However, I'd like to find a solution that is both memory efficient (X
>>>> can be very large) and which does not modify X in place.
>>>>
>>>> Basically, I'm wondering whether there was a BLAS subroutine that would
>>>> allow to compute cross(X, w, Y) in one pass without creating an
>>>> intermediate matrix as large as X or Y.
>>>>
>>>>
>>
>

Reply via email to