I was using a "CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz".

-erik

On Wed, Mar 23, 2016 at 10:53 AM, Igor Cerovsky
<[email protected]> wrote:
> Thanks, Erik. I've thought there is something deeper in the LLVM.
> Since I'm quite new to Julia, I'll follow your suggestions and send you
> some outputs.
> What is a processor you were running the benchmarks?
>
> On 23 March 2016 at 15:42, Erik Schnetter <[email protected]> wrote:
>>
>> I get a time ratio (bc / bb) of 1.1
>>
>> It could be that you're just having bad luck with the particular
>> optimization decisions that LLVM makes for the combined code, or with
>> the parameters (sizes) for this benchmark. Maybe the performance
>> difference changes for different matrix sizes? There's a million
>> things you can try, e.g. starting Julia with the "-O" option, or using
>> a different LLVM version. What would really help is gather more
>> detailed information, e.g. by looking at the disassembled loop kernels
>> (to see whether something is wrong), or using a profiler to see where
>> the time is spent (Julia has a built-in profiler), or gathering
>> statistics about floating point instructions executed and cache
>> operations (that requires an external tool).
>>
>> The disassembled code is CPU-specific and also depends on the LLVM
>> version. I'd be happy to have a quick glance at it if you create a
>> listing (with `@code_native`) and e.g. put it up as a gist
>> <gist.github.com>. I'd also need your CPU type (`versioninfo()` in
>> Julia, plus `cat /proc/cpuinfo` under Linux). No promises, though.
>>
>> -erik
>>
>> On Wed, Mar 23, 2016 at 4:04 AM, Igor Cerovsky
>> <[email protected]> wrote:
>> > I've attached two notebooks, you can check the comparisons.
>> > The first one is to compare rank1updatede! and rank1updateb! functions.
>> > The
>> > Julia to BLAS equivalent comparison gives ratio 1.13, what is nice. The
>> > same
>> > applies to mygemv vs Blas.gemv.
>> > Combining the same routines into the mgs algorithm in the very first
>> > post,
>> > the resulting performance is mgs / mgs_blas is 2.6 on my computer i7
>> > 6700HQ
>> > (that is important to mention, because on older processors the
>> > difference is
>> > not that big, it similar to comparing the routines rank1update and
>> > BLAS.ger). This is something what I'm trying to figure out why?
>> >
>> >
>> > On Tuesday, 22 March 2016 15:43:18 UTC+1, Erik Schnetter wrote:
>> >>
>> >> On Tue, Mar 22, 2016 at 4:36 AM, Igor Cerovsky
>> >> <[email protected]> wrote:
>> >> > The factor ~20% I've mentioned just because it is something what I've
>> >> > commonly observed, and of course can vary, and isn't that important.
>> >> >
>> >> > What bothers me is: why the performance drops 2-times, when I combine
>> >> > two
>> >> > routines where each one alone causes performance drop 0.2-times?
>> >>
>> >> I looked at the IJulia notebook you posted, but it wasn't obvious
>> >> which routines you mean. Can you point to exactly the source codes you
>> >> are comparing?
>> >>
>> >> -erik
>> >>
>> >> > In other words I have routines foo() and bar() and their equivalents
>> >> > in
>> >> > BLAS
>> >> > fooblas() barblas(); where
>> >> > @elapsed foo() / @elapsed fooblas() ~= 1.2
>> >> > The same for bar. Consider following pseudo-code
>> >> >   for k in 1:N
>> >> >     foo()  # my Julia implementation of a BLAS function for example
>> >> > gemv
>> >> >     bar()  # my Julia implementation of a BLAS function for example
>> >> > ger
>> >> >   end
>> >> > end
>> >> >
>> >> >
>> >> > function foobarblas()
>> >> >   for k in 1:N
>> >> >     fooblas()  # this is equivalent of foo in BLAS for example gemv
>> >> >     barblas()  # this is equivalent of bar in BLAS for example ger
>> >> >   end
>> >> > end
>> >> > then @elapsed foobar() / @elapsed foobarblas() ~= 2.6
>> >> >
>> >> >
>> >> > On Monday, 21 March 2016 15:35:58 UTC+1, Erik Schnetter wrote:
>> >> >>
>> >> >> The architecture-specific, manual BLAS optimizations don't just give
>> >> >> you an additional 20%. They can improve speed by a factor of a few.
>> >> >>
>> >> >> If you see a factor of 2.6, then that's probably to be accepted,
>> >> >> unless to really look into the details (generated assembler code,
>> >> >> measure cache misses, introduce manual vectorization and loop
>> >> >> unrolling, etc.) And you'll have to repeat that analysis if you're
>> >> >> using a different system.
>> >> >>
>> >> >> -erik
>> >> >>
>> >> >> On Mon, Mar 21, 2016 at 10:18 AM, Igor Cerovsky
>> >> >> <[email protected]> wrote:
>> >> >> > Well, maybe the subject of the post is confusing. I've tried to
>> >> >> > write
>> >> >> > an
>> >> >> > algorithm which runs approximately as fast as using BLAS
>> >> >> > functions,
>> >> >> > but
>> >> >> > using pure Julia implementation. Sure, we know, that BLAS is
>> >> >> > highly
>> >> >> > optimized, I don't wanted to beat BLAS, jus to be a bit slower,
>> >> >> > let
>> >> >> > us
>> >> >> > say
>> >> >> > ~1.2-times.
>> >> >> >
>> >> >> > If I take a part of the algorithm, and run it separately all works
>> >> >> > fine.
>> >> >> > Consider code below:
>> >> >> > function rank1update!(A, x, y)
>> >> >> >     for j = 1:size(A, 2)
>> >> >> >         @fastmath @inbounds @simd for i = 1:size(A, 1)
>> >> >> >             A[i,j] += 1.1 * y[j] * x[i]
>> >> >> >         end
>> >> >> >     end
>> >> >> > end
>> >> >> >
>> >> >> > function rank1updateb!(A, x, y)
>> >> >> >     R = BLAS.ger!(1.1, x, y, A)
>> >> >> > end
>> >> >> >
>> >> >> > Here BLAS is ~1.2-times faster.
>> >> >> > However, calling it together with 'mygemv!' in the loop (see code
>> >> >> > in
>> >> >> > original post), the performance drops to ~2.6 times slower then
>> >> >> > using
>> >> >> > BLAS
>> >> >> > functions (gemv, ger)
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Monday, 21 March 2016 13:34:27 UTC+1, Stefan Karpinski wrote:
>> >> >> >>
>> >> >> >> I'm not sure what the expected result here is. BLAS is designed
>> >> >> >> to
>> >> >> >> be
>> >> >> >> as
>> >> >> >> fast as possible at matrix multiply. I'd be more concerned if you
>> >> >> >> write
>> >> >> >> straightforward loop code and beat BLAS, since that means the
>> >> >> >> BLAS
>> >> >> >> is
>> >> >> >> badly
>> >> >> >> mistuned.
>> >> >> >>
>> >> >> >> On Mon, Mar 21, 2016 at 5:58 AM, Igor Cerovsky
>> >> >> >> <[email protected]>
>> >> >> >> wrote:
>> >> >> >>>
>> >> >> >>> Thanks Steven, I've thought there is something more behind...
>> >> >> >>>
>> >> >> >>> I shall note that that I forgot to mention matrix dimensions,
>> >> >> >>> which
>> >> >> >>> is
>> >> >> >>> 1000 x 1000.
>> >> >> >>>
>> >> >> >>> On Monday, 21 March 2016 10:48:33 UTC+1, Steven G. Johnson
>> >> >> >>> wrote:
>> >> >> >>>>
>> >> >> >>>> You need a lot more than just fast loops to match the
>> >> >> >>>> performance
>> >> >> >>>> of
>> >> >> >>>> an
>> >> >> >>>> optimized BLAS.    See e.g. this notebook for some comments on
>> >> >> >>>> the
>> >> >> >>>> related
>> >> >> >>>> case of matrix multiplication:
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> http://nbviewer.jupyter.org/url/math.mit.edu/~stevenj/18.335/Matrix-multiplication-experiments.ipynb
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Erik Schnetter <[email protected]>
>> >> >> http://www.perimeterinstitute.ca/personal/eschnetter/
>> >>
>> >>
>> >>
>> >> --
>> >> Erik Schnetter <[email protected]>
>> >> http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>>
>>
>> --
>> Erik Schnetter <[email protected]>
>> http://www.perimeterinstitute.ca/personal/eschnetter/
>
>



-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/

Reply via email to