I've attached two notebooks, you can check the comparisons. The first one is to compare rank1updatede! and rank1updateb! functions. The Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The same applies to mygemv vs Blas.gemv. Combining the same routines into the mgs algorithm in the very first post, the resulting performance is mgs / mgs_blas is 2.6 on *my computer* i7 6700HQ (that is important to mention, because on older processors the difference is not that big, it similar to comparing the routines rank1update and BLAS.ger). This is something what I'm trying to figure out why?
On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote: > > On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky > <[email protected] <javascript:>> wrote: > > The factor ~20% I've mentioned just because it is something what I've > > commonly observed, and of course can vary, and isn't that important. > > > > What bothers me is: why the performance drops 2-times, when I combine > two > > routines where each one alone causes performance drop 0.2-times? > > I looked at the IJulia notebook you posted, but it wasn't obvious > which routines you mean. Can you point to exactly the source codes you > are comparing? > > -erik > > > In other words I have routines foo() and bar() and their equivalents in > BLAS > > fooblas() barblas(); where > > @elapsed foo() / @elapsed fooblas() ~= 1.2 > > The same for bar. Consider following pseudo-code > > for k in 1:N > > foo() # my Julia implementation of a BLAS function for example gemv > > bar() # my Julia implementation of a BLAS function for example ger > > end > > end > > > > > > function foobarblas() > > for k in 1:N > > fooblas() # this is equivalent of foo in BLAS for example gemv > > barblas() # this is equivalent of bar in BLAS for example ger > > end > > end > > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6 > > > > > > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote: > >> > >> The architecture-specific, manual BLAS optimizations don't just give > >> you an additional 20%. They can improve speed by a factor of a few. > >> > >> If you see a factor of 2.6, then that's probably to be accepted, > >> unless to really look into the details (generated assembler code, > >> measure cache misses, introduce manual vectorization and loop > >> unrolling, etc.) And you'll have to repeat that analysis if you're > >> using a different system. > >> > >> -erik > >> > >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky > >> <[email protected]> wrote: > >> > Well, maybe the subject of the post is confusing. I've tried to write > an > >> > algorithm which runs approximately as fast as using BLAS functions, > but > >> > using pure Julia implementation. Sure, we know, that BLAS is highly > >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, let > us > >> > say > >> > ~1.2-times. > >> > > >> > If I take a part of the algorithm, and run it separately all works > fine. > >> > Consider code below: > >> > function rank1update!(A, x, y) > >> > for j = 1:size(A, 2) > >> > @fastmath @inbounds @simd for i = 1:size(A, 1) > >> > A[i,j] += 1.1 * y[j] * x[i] > >> > end > >> > end > >> > end > >> > > >> > function rank1updateb!(A, x, y) > >> > R = BLAS.ger!(1.1, x, y, A) > >> > end > >> > > >> > Here BLAS is ~1.2-times faster. > >> > However, calling it together with 'mygemv!' in the loop (see code in > >> > original post), the performance drops to ~2.6 times slower then using > >> > BLAS > >> > functions (gemv, ger) > >> > > >> > > >> > > >> > > >> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote: > >> >> > >> >> I'm not sure what the expected result here is. BLAS is designed to > be > >> >> as > >> >> fast as possible at matrix multiply. I'd be more concerned if you > write > >> >> straightforward loop code and beat BLAS, since that means the BLAS > is > >> >> badly > >> >> mistuned. > >> >> > >> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky < > [email protected]> > >> >> wrote: > >> >>> > >> >>> Thanks Steven, I've thought there is something more behind... > >> >>> > >> >>> I shall note that that I forgot to mention matrix dimensions, which > is > >> >>> 1000 x 1000. > >> >>> > >> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson wrote: > >> >>>> > >> >>>> You need a lot more than just fast loops to match the performance > of > >> >>>> an > >> >>>> optimized BLAS. See e.g. this notebook for some comments on the > >> >>>> related > >> >>>> case of matrix multiplication: > >> >>>> > >> >>>> > >> >>>> > >> >>>> > http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb > > >> >> > >> >> > >> > > >> > >> > >> > >> -- > >> Erik Schnetter <[email protected]> > >> http://www.perimeterinstitute.ca/personal/eschnetter/ > > > > -- > Erik Schnetter <[email protected] <javascript:>> > http://www.perimeterinstitute.ca/personal/eschnetter/ >
Rank1update-JuliaUsers-Question.ipynb
Description: Binary data
MGS-Julia-Benchmark.ipynb
Description: Binary data
