Well, maybe the subject of the post is confusing. I've tried to write an
algorithm which runs approximately as fast as using BLAS functions, but
using pure Julia implementation. Sure, we know, that BLAS is highly
optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us say
~1.2-times.
If I take a part of the algorithm, and run it separately all works fine.
Consider code below:
function rank1update!(A, x, y)
for j = 1:size(A, 2)
@fastmath @inbounds @simd for i = 1:size(A, 1)
A[i,j] += 1.1 * y[j] * x[i]
end
end
end
function rank1updateb!(A, x, y)
R = BLAS.ger!(1.1, x, y, A)
end
Here BLAS is ~1.2-times faster.
However, calling it together with 'mygemv!' in the loop (see code in
original post), the performance drops to ~2.6 times slower then using BLAS
functions (gemv, ger)
On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>
> I'm not sure what the expected result here is. BLAS is designed to be as
> fast as possible at matrix multiply. I'd be more concerned if you write
> straightforward loop code and beat BLAS, since that means the BLAS is badly
> mistuned.
>
> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <[email protected]
> <javascript:>> wrote:
>
>> Thanks Steven, I've thought there is something more behind...
>>
>> I shall note that that I forgot to mention matrix dimensions, which is
>> 1000 x 1000.
>>
>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>>>
>>> You need a lot more than just fast loops to match the performance of an
>>> optimized BLAS. See e.g. this notebook for some comments on the related
>>> case of matrix multiplication:
>>>
>>>
>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>>>
>>
>