The architecture-specific, manual BLAS optimizations don't just give
you an additional 20%. They can improve speed by a factor of a few.

If you see a factor of 2.6, then that's probably to be accepted,
unless to really look into the details (generated assembler code,
measure cache misses, introduce manual vectorization and loop
unrolling, etc.) And you'll have to repeat that analysis if you're
using a different system.

-erik

On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky
<[email protected]> wrote:
> Well, maybe the subject of the post is confusing. I've tried to write an
> algorithm which runs approximately as fast as using BLAS functions, but
> using pure Julia implementation. Sure, we know, that BLAS is highly
> optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us say
> ~1.2-times.
>
> If I take a part of the algorithm, and run it separately all works fine.
> Consider code below:
> function rank1update!(A, x, y)
>     for j = 1:size(A, 2)
>         @fastmath @inbounds @simd for i = 1:size(A, 1)
>             A[i,j] += 1.1 * y[j] * x[i]
>         end
>     end
> end
>
> function rank1updateb!(A, x, y)
>     R = BLAS.ger!(1.1, x, y, A)
> end
>
> Here BLAS is ~1.2-times faster.
> However, calling it together with 'mygemv!' in the loop (see code in
> original post), the performance drops to ~2.6 times slower then using BLAS
> functions (gemv, ger)
>
>
>
>
> On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>>
>> I'm not sure what the expected result here is. BLAS is designed to be as
>> fast as possible at matrix multiply. I'd be more concerned if you write
>> straightforward loop code and beat BLAS, since that means the BLAS is badly
>> mistuned.
>>
>> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <[email protected]>
>> wrote:
>>>
>>> Thanks Steven, I've thought there is something more behind...
>>>
>>> I shall note that that I forgot to mention matrix dimensions, which is
>>> 1000 x 1000.
>>>
>>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote:
>>>>
>>>> You need a lot more than just fast loops to match the performance of an
>>>> optimized BLAS.    See e.g. this notebook for some comments on the related
>>>> case of matrix multiplication:
>>>>
>>>>
>>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>>
>>
>



-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/

Reply via email to