and last update, I disabled Hyperthreading on the i7, and now it performes
as expected.
i7 no HT:
1000000 loops, best of 3: 879.57 ns per loop
100000 loops, best of 3: 9.88 µs per loop
100 loops, best of 3: 4.46 ms per loop
10000 loops, best of 3: 69.89 µs per loop
10000 loops, best of 3: 26.67 µs per loop
10 loops, best of 3: 95.08 ms per loop
i7 with HT:
1000000 loops, best of 3: 871.68 ns per loop
10000 loops, best of 3: 10.84 µs per loop
100 loops, best of 3: 5.19 ms per loop
10000 loops, best of 3: 71.35 µs per loop
10000 loops, best of 3: 26.65 µs per loop
1 loops, best of 3: 159.99 ms per loop
So all calls inside the loop are the same speed, but the whole loop, with
identical assembly code is ~60% slower if HT is enabled. Where can this
problem then arise from? LLVM? or thread pinning in the OS? Probably not a
julia problem then...
On Tuesday, April 5, 2016 at 7:54:16 PM UTC+2, Milan Bouchet-Valat wrote:
>
> Le mardi 05 avril 2016 à 10:18 -0700, Johannes Wagner a écrit :
> > hey Milan,
> > so consider following code:
> >
> > Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git")
> > using TimeIt
> >
> > v = rand(3)
> > r = rand(6000,3)
> > x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))'
> >
> > dotprods = r * x[2,:]
> > imexp = cis(dotprods)
> > sumprod = sum(imexp) * sum(conj(imexp))
> >
> > f(r, x) = r * x[2,:]
> > g(r, x) = r * x'
> > h(imexp) = sum(imexp) * sum(conj(imexp))
> >
> > function s(r, x)
> > result = zeros(size(x,1))
> > for i = 1:size(x,1)
> > imexp = cis(r * x[i,:])
> > result[i]= sum(imexp) * sum(conj(imexp))
> > end
> > return result
> > end
> >
> > @timeit zeros(size(x,1))
> > @timeit f(r,x)
> > @timeit g(r,x)
> > @timeit cis(dotprods)
> > @timeit h(imexp)
> > @timeit s(r,x)
> >
> > @code_native f(r,x)
> > @code_native g(r,x)
> > @code_native cis(dotprods)
> > @code_native h(imexp)
> > @code_native s(r,x)
> >
> > and I attached the output of the last @code_native s(r,x) as text
> > files for the binary tarball, as well as the latest nalimilan update.
> > For the whole function s, the exported code looks actually the same
> > everywhere.
> > But s(r,x) is the one that is considerable slower on the i7 than the
> > i5, whereas all the other timed calls are more or less same speed on
> > i5 and i7. Here are the timings in the same order as above (all run
> > repeatedly to not have compile time in it for last one):
> >
> > i7:
> > 1000000 loops, best of 3: 871.68 ns per loop
> > 10000 loops, best of 3: 10.84 µs per loop
> > 100 loops, best of 3: 5.19 ms per loop
> > 10000 loops, best of 3: 71.35 µs per loop
> > 10000 loops, best of 3: 26.65 µs per loop
> > 1 loops, best of 3: 159.99 ms per loop
> >
> > i5:
> > 100000 loops, best of 3: 1.01 µs per loop
> > 10000 loops, best of 3: 10.93 µs per loop
> > 100 loops, best of 3: 5.09 ms per loop
> > 10000 loops, best of 3: 75.93 µs per loop
> > 10000 loops, best of 3: 29.23 µs per loop
> > 1 loops, best of 3: 103.70 ms per loop
> >
> > So based on inside s(r,x) calls, the i7 should be faster, but the
> > whole s(r,x) is slower. Still clueless... And don't know how to
> > further pin this down...
> Thanks. I think you got mixed up with the different files, as the
> versioninfo() output indicates. Anyway, there's enough info to check
> which file corresponds to which Julia version, so that's OK. Indeed,
> when comparing the tests with binary tarballs, there's a call
> to jl_alloc_array_1d with the i7 (julia050_tarball-haswell-i7.txt),
> which is not present with the i5 (incorrectly named julia050_haswell-
> i7.txt). This is really unexpected.
>
> Could you file an issue on GitHub with a summary of what we've found
> (essentially your message), as well as links to 3 Gists giving the code
> and the contents of the two .txt files I mentioned above? That would be
> very helpful. Do not mention the Fedora packages at all, as the binary
> tarballs are closer to what Julia developers use.
>
>
> Regards
>
>
> > cheers, Johannes
> >
> >
> >
> >
> > > Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a écrit :
> > > > hey guys,
> > > > so attached you find text files with @code_native output for the
> > > > instructions
> > > > - r * x[1,:]
> > > > - cis(imexp)
> > > > - sum(imexp) * sum(conj(imexp))
> > > >
> > > > for julia 0.5.
> > > >
> > > > Hardware I run on is a Haswell i5 machine, a Haswell i7 machine,
> > > and
> > > > a IvyBridge i5 machine. Turned out on an Haswell i5 machine the
> > > code
> > > > also runs fast. Only the Haswell i7 machine is the slow one.
> > > This
> > > > really drove me nuts. First I thought it was the OS, then the
> > > > architecture, and now its just from i5 to i7.... Anyways, I
> > > don't
> > > > know anything about x86 assembly, but the julia 0.45 code is the
> > > same
> > > > on all machines. However, for the dot product, the 0.5 code has
> > > > already 2 different instructions on the i5 vs. the i7 (line
> > > 44&47).
> > > > For the cis call also (line 149...). And the IvyBridge i5 code
> > > is
> > > > similar to the Haswell i5. I included also versioninfo() at the
> > > top
> > > > of the file. So you could just look at a vimdiff of the julia0.5
> > > > files... Can anyone make sense out of this?
> > > I'm definitely not an expert in assembly, but that additional leaq
> > > instruction on line 44, and the additional movq instructions on
> > > line
> > > 111, 151 and 152 really look weird
> > >
> > > Could you do the same test with the binary tarballs? If the
> > > difference
> > > persists, you should open an issue on GitHub to track this.
> > >
> > > BTW, please wrap the fist call in a function to ensure it is
> > > specialized for the arguments types, i.e.:
> > >
> > > f(r, x) = r * x[1,:]
> > > @code_native f(r, x)
> > >
> > > Also, please check whether you still see the difference with this
> > > code:
> > > g(r, x) = r * x
> > > @code_native g(r, x[1,:])
> > >
> > > What are the types of r and x? Could you provide a simple
> > > reproducible example with dummy values?
> > >
> > > > The binary tarballs I will still test. If I remove the cis()
> > > call,
> > > > the difference is hard to tell, the loop is ~10times faster and
> > > more
> > > > or less all around 5ms. For the whole loop with cis() call, from
> > > i5
> > > > to i7 the difference is ~ 50ms on i5 to 90ms on i7.
> > > >
> > > > Shall I also post the julia 0.4 code?
> > > If it's identical for all machines, I don't think it's needed.
> > >
> > >
> > > Regards
> > >
> > >
> > > > cheers, Johannes
> > > >
> > > >
> > > >
> > > > > Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a
> > > écrit :
> > > > > >
> > > > > >
> > > > > > > Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner a
> > > écrit :
> > > > > > > > Sorry for not having expressed myself clearly, I meant
> > > the latest
> > > > > > > > version of fedora to work fine (24 development). I always
> > > used the
> > > > > > > > latest julia nightly available on the copr nalimilan
> > > repo. Right now
> > > > > > > > that is: 0.5.0-dev+3292, Commit 9d527c5*, all use
> > > > > > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
> > > > > > > >
> > > > > > > > peakflops on all machines (hardware identical) is
> > > ~1.2..1.5e11.
> > > > > > > >
> > > > > > > > Fedora 22&23 with julia 0.5 is ~50% slower then 0.4, only
> > > on fedora
> > > > > > > > 24 julia 0.5 is faster compared to julia 0.4.
> > > > > > > Could you try to find a simple code to reproduce the
> > > problem? In
> > > > > > > particular, it would be useful to check whether this comes
> > > from
> > > > > > > OpenBLAS differences or whether it also happens with pure
> > > Julia code
> > > > > > > (typical operations which depend on BLAS are matrix
> > > multiplication, as
> > > > > > > well as most of linear algebra). Normally, 0.4 and 0.5
> > > should use the
> > > > > > > same BLAS, but who knows...
> > > > > > well thats what I did, and the 3 simple calls inside the loop
> > > are
> > > > > > more or less same speed. only the whole loop seems slower.
> > > See my
> > > > > > code sample fromanswer march 8th (code gets in same
> > > proportions
> > > > > > faster when exp(im .* dotprods) is replaced by cis(dotprods)
> > > ).
> > > > > > So I don't know what I can do then...
> > > > > Sorry, somehow I had missed that message. This indeed looks
> > > like a code
> > > > > generation issue in Julia/LLVM.
> > > > >
> > > > > > > Can you also confirm that all versioninfo() fields are the
> > > same for all
> > > > > > > three machines, both for 0.4 and 0.5? We must envision the
> > > possibility
> > > > > > > that the differences actually come from 0.4.
> > > > > > ohoh, right! just noticed that my fedora 24 machine was an
> > > ivy bridge
> > > > > > which works fast:
> > > > > >
> > > > > > Julia Version 0.5.0-dev+3292
> > > > > > Commit 9d527c5* (2016-03-28 06:55 UTC)
> > > > > > Platform Info:
> > > > > > System: Linux (x86_64-redhat-linux)
> > > > > > CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
> > > > > > WORD_SIZE: 64
> > > > > > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
> > > > > > LAPACK: libopenblasp.so.0
> > > > > > LIBM: libopenlibm
> > > > > > LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)
> > > > > >
> > > > > > and the other ones with fed22/23 are haswell, which work
> > > slow:
> > > > > >
> > > > > > Julia Version 0.5.0-dev+3292
> > > > > > Commit 9d527c5* (2016-03-28 06:55 UTC)
> > > > > > Platform Info:
> > > > > > System: Linux (x86_64-redhat-linux)
> > > > > > CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> > > > > > WORD_SIZE: 64
> > > > > > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
> > > > > > LAPACK: libopenblasp.so.0
> > > > > > LIBM: libopenlibm
> > > > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
> > > > > >
> > > > > > I just booted an fedora 23 on the ivy bridge machine and it's
> > > also fast.
> > > > > >
> > > > > > Now if I use julia 0.45 on both architectures:
> > > > > >
> > > > > > Julia Version 0.4.5
> > > > > > Commit 2ac304d* (2016-03-18 00:58 UTC)
> > > > > > Platform Info:
> > > > > > System: Linux (x86_64-redhat-linux)
> > > > > > CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> > > > > > WORD_SIZE: 64
> > > > > > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
> > > > > > LAPACK: libopenblasp.so.0
> > > > > > LIBM: libopenlibm
> > > > > > LLVM: libLLVM-3.3
> > > > > >
> > > > > > and:
> > > > > >
> > > > > > Julia Version 0.4.5
> > > > > > Commit 2ac304d* (2016-03-18 00:58 UTC)
> > > > > > Platform Info:
> > > > > > System: Linux (x86_64-redhat-linux)
> > > > > > CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
> > > > > > WORD_SIZE: 64
> > > > > > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
> > > > > > LAPACK: libopenblasp.so.0
> > > > > > LIBM: libopenlibm
> > > > > > LLVM: libLLVM-3.3
> > > > > >
> > > > > > there is no speed difference apart from the ~10% or so from
> > > the
> > > > > > faster haswell machine. So could perhaps be haswell hardware
> > > target
> > > > > > specific with the change from llvm 3.3 to 3.7.1? Is there
> > > anything
> > > > > > else I could provide?
> > > > > This is certainly an interesting finding. Could you paste
> > > somewhere the
> > > > > output of @code_native for your function on Sandybridge vs.
> > > Haswell,
> > > > > for both 0.4 and 0.5?
> > > > >
> > > > > It would also be useful to check whether the same difference
> > > appears if
> > > > > you use the generic binary tarballs from http://julialang.org/d
> > > ownloads
> > > > > .
> > > > >
> > > > > Finally, do you get the same result if you remove the call to
> > > exp()
> > > > > from the loop? (This is the only external function, so it
> > > shouldn't be
> > > > > affected by changes in Julia.)
> > > > >
> > > > >
> > > > > Regards
> > > > >
> > > > >
> > > > > > Best, Johannes
> > > > > >
> > > > > > > Regards
> > > > > >
> > > > > >
> > > > > > > > Le mercredi 16 mars 2016 à 09:25 -0700, Johannes Wagner a
> > > écrit :
> > > > > > > > > just a little update. Tested some other fedoras: Fedora
> > > 22 with llvm
> > > > > > > > > 3.8 is also slow with julia 0.5, whereas a fedora 24
> > > branch with llvm
> > > > > > > > > 3.7 is faster on julia 0.5 compared to julia 0.4, as it
> > > should be
> > > > > > > > > (speedup from inner loop parts translated into speedup
> > > to whole
> > > > > > > > > function).
> > > > > > > > >
> > > > > > > > > don't know if anyone cares about that... At least the
> > > latest version
> > > > > > > > > seems to work fine, hope it stays like this into the
> > > final fedora 24
> > > > > > > > What's the "latest version"? git built from source or RPM
> > > nightlies?
> > > > > > > > With which LLVM version for each?
> > > > > > > >
> > > > > > > > If from the RPMs, I've switched them to LLVM 3.8 for a
> > > few days, and
> > > > > > > > went back to 3.7 because of a build failure. So that
> > > might explain the
> > > > > > > > difference. You can install the last version which built
> > > with LLVM 3.8
> > > > > > > > manually from here:
> > > > > > > > https://copr-be.cloud.fedoraproject.org/results/nalimilan
> > > /julia-nightlies/fedora-23-x86_64/00167549-julia/
> > > > > > > >
> > > > > > > > It would be interesting to compare it with the latest
> > > nightly with 3.7.
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > > hey guys,
> > > > > > > > > > I just experienced something weird. I have some code
> > > that runs fine
> > > > > > > > > > on 0.43, then I updated to 0.5dev to test the new
> > > Arrays, run same
> > > > > > > > > > code and noticed it got about ~50% slower. Then I
> > > downgraded back
> > > > > > > > > > to 0.43, ran the old code, but speed remained slow. I
> > > noticed while
> > > > > > > > > > reinstalling 0.43, openblas-threads didn't get
> > > isntalled along with
> > > > > > > > > > it. So I manually installed it, but no change.
> > > > > > > > > > Does anyone has an idea what could be going on? LLVM
> > > on fedora23 is
> > > > > > > > > > 3.7
> > > > > > > > > >
> > > > > > > > > > Cheers, Johannes
> > > > > > > > > >
>