assembler code is attached.

versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Igor

On Wednesday, 23 March 2016 16:05:20 UTC+1, Erik Schnetter wrote:
>
> I was using a "CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz". 
>
> -erik 
>
> On Wed, Mar 23, 2016 at 10:53 AM, Igor Cerovsky 
> <[email protected] <javascript:>> wrote: 
> > Thanks, Erik. I've thought there is something deeper in the LLVM. 
> > Since I'm quite new to Julia, I'll follow your suggestions and send you 
> > some outputs. 
> > What is a processor you were running the benchmarks? 
> > 
> > On 23 March 2016 at 15:42, Erik Schnetter <[email protected] 
> <javascript:>> wrote: 
> >> 
> >> I get a time ratio (bc / bb) of 1.1 
> >> 
> >> It could be that you're just having bad luck with the particular 
> >> optimization decisions that LLVM makes for the combined code, or with 
> >> the parameters (sizes) for this benchmark. Maybe the performance 
> >> difference changes for different matrix sizes? There's a million 
> >> things you can try, e.g. starting Julia with the "-O" option, or using 
> >> a different LLVM version. What would really help is gather more 
> >> detailed information, e.g. by looking at the disassembled loop kernels 
> >> (to see whether something is wrong), or using a profiler to see where 
> >> the time is spent (Julia has a built-in profiler), or gathering 
> >> statistics about floating point instructions executed and cache 
> >> operations (that requires an external tool). 
> >> 
> >> The disassembled code is CPU-specific and also depends on the LLVM 
> >> version. I'd be happy to have a quick glance at it if you create a 
> >> listing (with `@code_native`) and e.g. put it up as a gist 
> >> <gist.github.com>. I'd also need your CPU type (`versioninfo()` in 
> >> Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though. 
> >> 
> >> -erik 
> >> 
> >> On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky 
> >> <[email protected] <javascript:>> wrote: 
> >> > I've attached two notebooks, you can check the comparisons. 
> >> > The first one is to compare rank1updatede! and rank1updateb! 
> functions. 
> >> > The 
> >> > Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. 
> The 
> >> > same 
> >> > applies to mygemv vs Blas.gemv. 
> >> > Combining the same routines into the mgs algorithm in the very first 
> >> > post, 
> >> > the resulting performance is mgs / mgs_blas is 2.6 on my computer i7 
> >> > 6700HQ 
> >> > (that is important to mention, because on older processors the 
> >> > difference is 
> >> > not that big, it similar to comparing the routines rank1update and 
> >> > BLAS.ger). This is something what I'm trying to figure out why? 
> >> > 
> >> > 
> >> > On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote: 
> >> >> 
> >> >> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky 
> >> >> <[email protected]> wrote: 
> >> >> > The factor ~20% I've mentioned just because it is something what 
> I've 
> >> >> > commonly observed, and of course can vary, and isn't that 
> important. 
> >> >> > 
> >> >> > What bothers me is: why the performance drops 2-times, when I 
> combine 
> >> >> > two 
> >> >> > routines where each one alone causes performance drop 0.2-times? 
> >> >> 
> >> >> I looked at the IJulia notebook you posted, but it wasn't obvious 
> >> >> which routines you mean. Can you point to exactly the source codes 
> you 
> >> >> are comparing? 
> >> >> 
> >> >> -erik 
> >> >> 
> >> >> > In other words I have routines foo() and bar() and their 
> equivalents 
> >> >> > in 
> >> >> > BLAS 
> >> >> > fooblas() barblas(); where 
> >> >> > @elapsed foo() / @elapsed fooblas() ~= 1.2 
> >> >> > The same for bar. Consider following pseudo-code 
> >> >> >   for k in 1:N 
> >> >> >     foo()  # my Julia implementation of a BLAS function for 
> example 
> >> >> > gemv 
> >> >> >     bar()  # my Julia implementation of a BLAS function for 
> example 
> >> >> > ger 
> >> >> >   end 
> >> >> > end 
> >> >> > 
> >> >> > 
> >> >> > function foobarblas() 
> >> >> >   for k in 1:N 
> >> >> >     fooblas()  # this is equivalent of foo in BLAS for example 
> gemv 
> >> >> >     barblas()  # this is equivalent of bar in BLAS for example ger 
> >> >> >   end 
> >> >> > end 
> >> >> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6 
> >> >> > 
> >> >> > 
> >> >> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote: 
> >> >> >> 
> >> >> >> The architecture-specific, manual BLAS optimizations don't just 
> give 
> >> >> >> you an additional 20%. They can improve speed by a factor of a 
> few. 
> >> >> >> 
> >> >> >> If you see a factor of 2.6, then that's probably to be accepted, 
> >> >> >> unless to really look into the details (generated assembler code, 
> >> >> >> measure cache misses, introduce manual vectorization and loop 
> >> >> >> unrolling, etc.) And you'll have to repeat that analysis if 
> you're 
> >> >> >> using a different system. 
> >> >> >> 
> >> >> >> -erik 
> >> >> >> 
> >> >> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky 
> >> >> >> <[email protected]> wrote: 
> >> >> >> > Well, maybe the subject of the post is confusing. I've tried to 
> >> >> >> > write 
> >> >> >> > an 
> >> >> >> > algorithm which runs approximately as fast as using BLAS 
> >> >> >> > functions, 
> >> >> >> > but 
> >> >> >> > using pure Julia implementation. Sure, we know, that BLAS is 
> >> >> >> > highly 
> >> >> >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower, 
> >> >> >> > let 
> >> >> >> > us 
> >> >> >> > say 
> >> >> >> > ~1.2-times. 
> >> >> >> > 
> >> >> >> > If I take a part of the algorithm, and run it separately all 
> works 
> >> >> >> > fine. 
> >> >> >> > Consider code below: 
> >> >> >> > function rank1update!(A, x, y) 
> >> >> >> >     for j = 1:size(A, 2) 
> >> >> >> >         @fastmath @inbounds @simd for i = 1:size(A, 1) 
> >> >> >> >             A[i,j] += 1.1 * y[j] * x[i] 
> >> >> >> >         end 
> >> >> >> >     end 
> >> >> >> > end 
> >> >> >> > 
> >> >> >> > function rank1updateb!(A, x, y) 
> >> >> >> >     R = BLAS.ger!(1.1, x, y, A) 
> >> >> >> > end 
> >> >> >> > 
> >> >> >> > Here BLAS is ~1.2-times faster. 
> >> >> >> > However, calling it together with 'mygemv!' in the loop (see 
> code 
> >> >> >> > in 
> >> >> >> > original post), the performance drops to ~2.6 times slower then 
> >> >> >> > using 
> >> >> >> > BLAS 
> >> >> >> > functions (gemv, ger) 
> >> >> >> > 
> >> >> >> > 
> >> >> >> > 
> >> >> >> > 
> >> >> >> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski 
> wrote: 
> >> >> >> >> 
> >> >> >> >> I'm not sure what the expected result here is. BLAS is 
> designed 
> >> >> >> >> to 
> >> >> >> >> be 
> >> >> >> >> as 
> >> >> >> >> fast as possible at matrix multiply. I'd be more concerned if 
> you 
> >> >> >> >> write 
> >> >> >> >> straightforward loop code and beat BLAS, since that means the 
> >> >> >> >> BLAS 
> >> >> >> >> is 
> >> >> >> >> badly 
> >> >> >> >> mistuned. 
> >> >> >> >> 
> >> >> >> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky 
> >> >> >> >> <[email protected]> 
> >> >> >> >> wrote: 
> >> >> >> >>> 
> >> >> >> >>> Thanks Steven, I've thought there is something more behind... 
> >> >> >> >>> 
> >> >> >> >>> I shall note that that I forgot to mention matrix dimensions, 
> >> >> >> >>> which 
> >> >> >> >>> is 
> >> >> >> >>> 1000 x 1000. 
> >> >> >> >>> 
> >> >> >> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson 
> >> >> >> >>> wrote: 
> >> >> >> >>>> 
> >> >> >> >>>> You need a lot more than just fast loops to match the 
> >> >> >> >>>> performance 
> >> >> >> >>>> of 
> >> >> >> >>>> an 
> >> >> >> >>>> optimized BLAS.    See e.g. this notebook for some comments 
> on 
> >> >> >> >>>> the 
> >> >> >> >>>> related 
> >> >> >> >>>> case of matrix multiplication: 
> >> >> >> >>>> 
> >> >> >> >>>> 
> >> >> >> >>>> 
> >> >> >> >>>> 
> >> >> >> >>>> 
> >> >> >> >>>> 
> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>  
> >> >> >> >> 
> >> >> >> >> 
> >> >> >> > 
> >> >> >> 
> >> >> >> 
> >> >> >> 
> >> >> >> -- 
> >> >> >> Erik Schnetter <[email protected]> 
> >> >> >> http://www.perimeterinstitute.ca/personal/eschnetter/ 
> >> >> 
> >> >> 
> >> >> 
> >> >> -- 
> >> >> Erik Schnetter <[email protected]> 
> >> >> http://www.perimeterinstitute.ca/personal/eschnetter/ 
> >> 
> >> 
> >> 
> >> -- 
> >> Erik Schnetter <[email protected] <javascript:>> 
> >> http://www.perimeterinstitute.ca/personal/eschnetter/ 
> > 
> > 
>
>
>
> -- 
> Erik Schnetter <[email protected] <javascript:>> 
> http://www.perimeterinstitute.ca/personal/eschnetter/ 
>

Attachment: code_native_mgs.asm
Description: Binary data

Attachment: code_native_mgs_blas.asm
Description: Binary data

Reply via email to