I was using a "CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz". -erik
On Wed, Mar 23, 2016 at 10:53 AM, Igor Cerovsky <[email protected]> wrote: > Thanks, Erik. I've thought there is something deeper in the LLVM. > Since I'm quite new to Julia, I'll follow your suggestions and send you > some outputs. > What is a processor you were running the benchmarks? > > On 23 March 2016 at 15:42, Erik Schnetter <[email protected]> wrote: >> >> I get a time ratio (bc / bb) of 1.1 >> >> It could be that you're just having bad luck with the particular >> optimization decisions that LLVM makes for the combined code, or with >> the parameters (sizes) for this benchmark. Maybe the performance >> difference changes for different matrix sizes? There's a million >> things you can try, e.g. starting Julia with the "-O" option, or using >> a different LLVM version. What would really help is gather more >> detailed information, e.g. by looking at the disassembled loop kernels >> (to see whether something is wrong), or using a profiler to see where >> the time is spent (Julia has a built-in profiler), or gathering >> statistics about floating point instructions executed and cache >> operations (that requires an external tool). >> >> The disassembled code is CPU-specific and also depends on the LLVM >> version. I'd be happy to have a quick glance at it if you create a >> listing (with `@code_native`) and e.g. put it up as a gist >> <gist.github.com>. I'd also need your CPU type (`versioninfo()` in >> Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though. >> >> -erik >> >> On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky >> <[email protected]> wrote: >> > I've attached two notebooks, you can check the comparisons. >> > The first one is to compare rank1updatede! and rank1updateb! functions. >> > The >> > Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The >> > same >> > applies to mygemv vs Blas.gemv. >> > Combining the same routines into the mgs algorithm in the very first >> > post, >> > the resulting performance is mgs / mgs_blas is 2.6 on my computer i7 >> > 6700HQ >> > (that is important to mention, because on older processors the >> > difference is >> > not that big, it similar to comparing the routines rank1update and >> > BLAS.ger). This is something what I'm trying to figure out why? >> > >> > >> > On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote: >> >> >> >> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky >> >> <[email protected]> wrote: >> >> > The factor ~20% I've mentioned just because it is something what I've >> >> > commonly observed, and of course can vary, and isn't that important. >> >> > >> >> > What bothers me is: why the performance drops 2-times, when I combine >> >> > two >> >> > routines where each one alone causes performance drop 0.2-times? >> >> >> >> I looked at the IJulia notebook you posted, but it wasn't obvious >> >> which routines you mean. Can you point to exactly the source codes you >> >> are comparing? >> >> >> >> -erik >> >> >> >> > In other words I have routines foo() and bar() and their equivalents >> >> > in >> >> > BLAS >> >> > fooblas() barblas(); where >> >> > @elapsed foo() / @elapsed fooblas() ~= 1.2 >> >> > The same for bar. Consider following pseudo-code >> >> > for k in 1:N >> >> > foo() # my Julia implementation of a BLAS function for example >> >> > gemv >> >> > bar() # my Julia implementation of a BLAS function for example >> >> > ger >> >> > end >> >> > end >> >> > >> >> > >> >> > function foobarblas() >> >> > for k in 1:N >> >> > fooblas() # this is equivalent of foo in BLAS for example gemv >> >> > barblas() # this is equivalent of bar in BLAS for example ger >> >> > end >> >> > end >> >> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6 >> >> > >> >> > >> >> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote: >> >> >> >> >> >> The architecture-specific, manual BLAS optimizations don't just give >> >> >> you an additional 20%. They can improve speed by a factor of a few. >> >> >> >> >> >> If you see a factor of 2.6, then that's probably to be accepted, >> >> >> unless to really look into the details (generated assembler code, >> >> >> measure cache misses, introduce manual vectorization and loop >> >> >> unrolling, etc.) And you'll have to repeat that analysis if you're >> >> >> using a different system. >> >> >> >> >> >> -erik >> >> >> >> >> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky >> >> >> <[email protected]> wrote: >> >> >> > Well, maybe the subject of the post is confusing. I've tried to >> >> >> > write >> >> >> > an >> >> >> > algorithm which runs approximately as fast as using BLAS >> >> >> > functions, >> >> >> > but >> >> >> > using pure Julia implementation. Sure, we know, that BLAS is >> >> >> > highly >> >> >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, >> >> >> > let >> >> >> > us >> >> >> > say >> >> >> > ~1.2-times. >> >> >> > >> >> >> > If I take a part of the algorithm, and run it separately all works >> >> >> > fine. >> >> >> > Consider code below: >> >> >> > function rank1update!(A, x, y) >> >> >> > for j = 1:size(A, 2) >> >> >> > @fastmath @inbounds @simd for i = 1:size(A, 1) >> >> >> > A[i,j] += 1.1 * y[j] * x[i] >> >> >> > end >> >> >> > end >> >> >> > end >> >> >> > >> >> >> > function rank1updateb!(A, x, y) >> >> >> > R = BLAS.ger!(1.1, x, y, A) >> >> >> > end >> >> >> > >> >> >> > Here BLAS is ~1.2-times faster. >> >> >> > However, calling it together with 'mygemv!' in the loop (see code >> >> >> > in >> >> >> > original post), the performance drops to ~2.6 times slower then >> >> >> > using >> >> >> > BLAS >> >> >> > functions (gemv, ger) >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote: >> >> >> >> >> >> >> >> I'm not sure what the expected result here is. BLAS is designed >> >> >> >> to >> >> >> >> be >> >> >> >> as >> >> >> >> fast as possible at matrix multiply. I'd be more concerned if you >> >> >> >> write >> >> >> >> straightforward loop code and beat BLAS, since that means the >> >> >> >> BLAS >> >> >> >> is >> >> >> >> badly >> >> >> >> mistuned. >> >> >> >> >> >> >> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky >> >> >> >> <[email protected]> >> >> >> >> wrote: >> >> >> >>> >> >> >> >>> Thanks Steven, I've thought there is something more behind... >> >> >> >>> >> >> >> >>> I shall note that that I forgot to mention matrix dimensions, >> >> >> >>> which >> >> >> >>> is >> >> >> >>> 1000 x 1000. >> >> >> >>> >> >> >> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson >> >> >> >>> wrote: >> >> >> >>>> >> >> >> >>>> You need a lot more than just fast loops to match the >> >> >> >>>> performance >> >> >> >>>> of >> >> >> >>>> an >> >> >> >>>> optimized BLAS. See e.g. this notebook for some comments on >> >> >> >>>> the >> >> >> >>>> related >> >> >> >>>> case of matrix multiplication: >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Erik Schnetter <[email protected]> >> >> >> http://www.perimeterinstitute.ca/personal/eschnetter/ >> >> >> >> >> >> >> >> -- >> >> Erik Schnetter <[email protected]> >> >> http://www.perimeterinstitute.ca/personal/eschnetter/ >> >> >> >> -- >> Erik Schnetter <[email protected]> >> http://www.perimeterinstitute.ca/personal/eschnetter/ > > -- Erik Schnetter <[email protected]> http://www.perimeterinstitute.ca/personal/eschnetter/
