Well, maybe the subject of the post is confusing. I've tried to write an 
algorithm which runs approximately as fast as using BLAS functions, but 
using pure Julia implementation. Sure, we know, that BLAS is highly 
optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us say 
~1.2-times. 

If I take a part of the algorithm, and run it separately all works fine. 
Consider code below:
function rank1update!(A, x, y)
    for j = 1:size(A, 2)
        @fastmath @inbounds @simd for i = 1:size(A, 1)
            A[i,j] += 1.1 * y[j] * x[i]
        end
    end
end

function rank1updateb!(A, x, y)
    R = BLAS.ger!(1.1, x, y, A)
end

Here BLAS is ~1.2-times faster.
However, calling it together with 'mygemv!' in the loop (see code in 
original post), the performance drops to ~2.6 times slower then using BLAS 
functions (gemv, ger)




On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>
> I'm not sure what the expected result here is. BLAS is designed to be as 
> fast as possible at matrix multiply. I'd be more concerned if you write 
> straightforward loop code and beat BLAS, since that means the BLAS is badly 
> mistuned.
>
> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <[email protected] 
> <javascript:>> wrote:
>
>> Thanks Steven, I've thought there is something more behind...
>>
>> I shall note that that I forgot to mention matrix dimensions, which is 
>> 1000 x 1000.
>>
>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>>>
>>> You need a lot more than just fast loops to match the performance of an 
>>> optimized BLAS.    See e.g. this notebook for some comments on the related 
>>> case of matrix multiplication:
>>>
>>>
>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>>>
>>
>

Reply via email to