The factor ~20% I've mentioned just because it is something what I've 
commonly observed, and of course can vary, and isn't that important.

What bothers me is: why the performance drops 2-times, when I combine two 
routines where each one alone causes performance drop 0.2-times? 
In other words I have routines foo() and bar() and their equivalents in 
BLAS fooblas() barblas(); where 
*@elapsed foo() / @elapsed fooblas() ~= 1.2 *
The same for bar. Consider following pseudo-code
  for k in 1:N
    foo()  # my Julia implementation of a BLAS function for example gemv
    bar()  # my Julia implementation of a BLAS function for example ger
  end
end


function foobarblas()
  for k in 1:N
    fooblas()  # this is equivalent of foo in BLAS for example gemv 
    barblas()  # this is equivalent of bar in BLAS for example ger
  end
end
then *@elapsed foobar() / @elapsed foobarblas() ~= 2.6*


On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
>
> The architecture-specific, manual BLAS optimizations don't just give 
> you an additional 20%. They can improve speed by a factor of a few. 
>
> If you see a factor of 2.6, then that's probably to be accepted, 
> unless to really look into the details (generated assembler code, 
> measure cache misses, introduce manual vectorization and loop 
> unrolling, etc.) And you'll have to repeat that analysis if you're 
> using a different system. 
>
> -erik 
>
> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky 
> <[email protected] <javascript:>> wrote: 
> > Well, maybe the subject of the post is confusing. I've tried to write an 
> > algorithm which runs approximately as fast as using BLAS functions, but 
> > using pure Julia implementation. Sure, we know, that BLAS is highly 
> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let us 
> say 
> > ~1.2-times. 
> > 
> > If I take a part of the algorithm, and run it separately all works fine. 
> > Consider code below: 
> > function rank1update!(A, x, y) 
> >     for j = 1:size(A, 2) 
> >         @fastmath @inbounds @simd for i = 1:size(A, 1) 
> >             A[i,j] += 1.1 * y[j] * x[i] 
> >         end 
> >     end 
> > end 
> > 
> > function rank1updateb!(A, x, y) 
> >     R = BLAS.ger!(1.1, x, y, A) 
> > end 
> > 
> > Here BLAS is ~1.2-times faster. 
> > However, calling it together with 'mygemv!' in the loop (see code in 
> > original post), the performance drops to ~2.6 times slower then using 
> BLAS 
> > functions (gemv, ger) 
> > 
> > 
> > 
> > 
> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote: 
> >> 
> >> I'm not sure what the expected result here is. BLAS is designed to be 
> as 
> >> fast as possible at matrix multiply. I'd be more concerned if you write 
> >> straightforward loop code and beat BLAS, since that means the BLAS is 
> badly 
> >> mistuned. 
> >> 
> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky <[email protected]> 
> >> wrote: 
> >>> 
> >>> Thanks Steven, I've thought there is something more behind... 
> >>> 
> >>> I shall note that that I forgot to mention matrix dimensions, which is 
> >>> 1000 x 1000. 
> >>> 
> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote: 
> >>>> 
> >>>> You need a lot more than just fast loops to match the performance of 
> an 
> >>>> optimized BLAS.    See e.g. this notebook for some comments on the 
> related 
> >>>> case of matrix multiplication: 
> >>>> 
> >>>> 
> >>>> 
> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>  
> >> 
> >> 
> > 
>
>
>
> -- 
> Erik Schnetter <[email protected] <javascript:>> 
> http://www.perimeterinstitute.ca/personal/eschnetter/ 
>

Reply via email to