Re: [julia-users] regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

Johannes Wagner Wed, 06 Apr 2016 08:03:19 -0700


> On 6 Apr  2016, at 4:46 PM, Milan Bouchet-Valat <[email protected]> wrote:
> 
> Le mercredi 06 avril 2016 à 07:25 -0700, Johannes Wagner a écrit :
>> and last update, I disabled Hyperthreading on the i7, and now it
>> performes as expected.
>> 
>> i7 no HT:
>> 
>> 1000000 loops, best of 3: 879.57 ns per loop
>> 100000 loops, best of 3: 9.88 µs per loop
>> 100 loops, best of 3: 4.46 ms per loop
>> 10000 loops, best of 3: 69.89 µs per loop
>> 10000 loops, best of 3: 26.67 µs per loop
>> 10 loops, best of 3: 95.08 ms per loop
>> 
>> i7 with HT:
>> 
>> 1000000 loops, best of 3: 871.68 ns per loop
>> 10000 loops, best of 3: 10.84 µs per loop
>> 100 loops, best of 3: 5.19 ms per loop
>> 10000 loops, best of 3: 71.35 µs per loop
>> 10000 loops, best of 3: 26.65 µs per loop
>> 1 loops, best of 3: 159.99 ms per loop
>> 
>> So all calls inside the loop are the same speed, but the whole loop,
>> with identical assembly code is ~60% slower if HT is enabled. Where
>> can this problem then arise from? LLVM? or thread pinning in the OS?
>> Probably not a julia problem then...
> Indeed, in the last assembly output you sent, there are no differences
> between i5 and i7 (as expected). So this isn't Julia's nor LLVM's
> fault. No idea whether there might be an issue with the CPU itself, but
> it's quite surprising.


Run it on a 2nd i7 machine. Same behavior, so definitely not a faulty cpu. Do 
you have any other idea what to do? Just leave it as is and now use julia with 
disabled hyper threading is not really satisfactory...


> Regards
> 
> 
>>> Le mardi 05 avril 2016 à 10:18 -0700, Johannes Wagner a écrit : 
>>>> hey Milan, 
>>>> so consider following code: 
>>>>  
>>>> Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git") 
>>>> using TimeIt 
>>>>  
>>>> v = rand(3) 
>>>> r = rand(6000,3) 
>>>> x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))' 
>>>>  
>>>> dotprods = r * x[2,:] 
>>>> imexp    = cis(dotprods) 
>>>> sumprod  = sum(imexp) * sum(conj(imexp)) 
>>>>  
>>>> f(r, x) = r * x[2,:]     
>>>> g(r, x) = r * x' 
>>>> h(imexp)    = sum(imexp) * sum(conj(imexp)) 
>>>>  
>>>> function s(r, x) 
>>>>         result = zeros(size(x,1)) 
>>>>         for i = 1:size(x,1) 
>>>>                 imexp    = cis(r * x[i,:]) 
>>>>                 result[i]= sum(imexp) * sum(conj(imexp)) 
>>>>         end 
>>>>         return result 
>>>> end 
>>>>  
>>>> @timeit zeros(size(x,1)) 
>>>> @timeit f(r,x) 
>>>> @timeit g(r,x) 
>>>> @timeit cis(dotprods) 
>>>> @timeit h(imexp) 
>>>> @timeit s(r,x) 
>>>>  
>>>> @code_native f(r,x) 
>>>> @code_native g(r,x) 
>>>> @code_native cis(dotprods) 
>>>> @code_native h(imexp) 
>>>> @code_native s(r,x) 
>>>>  
>>>> and I attached the output of the last @code_native s(r,x) as
>>> text 
>>>> files for the binary tarball, as well as the latest nalimilan
>>> update. 
>>>> For the whole function s, the exported code looks actually the
>>> same 
>>>> everywhere. 
>>>> But s(r,x) is the one that is considerable slower on the i7 than
>>> the 
>>>> i5, whereas all the other timed calls are more or less same speed
>>> on 
>>>> i5 and i7. Here are the timings in the same order as above (all
>>> run 
>>>> repeatedly to not have compile time in it for last one): 
>>>>  
>>>> i7: 
>>>> 1000000 loops, best of 3: 871.68 ns per loop 
>>>> 10000 loops, best of 3: 10.84 µs per loop 
>>>> 100 loops, best of 3: 5.19 ms per loop 
>>>> 10000 loops, best of 3: 71.35 µs per loop 
>>>> 10000 loops, best of 3: 26.65 µs per loop 
>>>> 1 loops, best of 3: 159.99 ms per loop 
>>>>  
>>>> i5: 
>>>> 100000 loops, best of 3: 1.01 µs per loop 
>>>> 10000 loops, best of 3: 10.93 µs per loop 
>>>> 100 loops, best of 3: 5.09 ms per loop 
>>>> 10000 loops, best of 3: 75.93 µs per loop 
>>>> 10000 loops, best of 3: 29.23 µs per loop 
>>>> 1 loops, best of 3: 103.70 ms per loop 
>>>>  
>>>> So based on inside s(r,x) calls, the i7 should be faster, but
>>> the 
>>>> whole s(r,x) is slower. Still clueless... And don't know how to 
>>>> further pin this down... 
>>> Thanks. I think you got mixed up with the different files, as the 
>>> versioninfo() output indicates. Anyway, there's enough info to
>>> check 
>>> which file corresponds to which Julia version, so that's OK.
>>> Indeed, 
>>> when comparing the tests with binary tarballs, there's a call 
>>> to jl_alloc_array_1d with the i7 (julia050_tarball-haswell-
>>> i7.txt), 
>>> which is not present with the i5 (incorrectly
>>> named julia050_haswell- 
>>> i7.txt). This is really unexpected. 
>>> 
>>> Could you file an issue on GitHub with a summary of what we've
>>> found 
>>> (essentially your message), as well as links to 3 Gists giving the
>>> code 
>>> and the contents of the two .txt files I mentioned above? That
>>> would be 
>>> very helpful. Do not mention the Fedora packages at all, as the
>>> binary 
>>> tarballs are closer to what Julia developers use. 
>>> 
>>> 
>>> Regards 
>>> 
>>> 
>>>> cheers, Johannes 
>>>>  
>>>>  
>>>>  
>>>>  
>>>>> Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a
>>> écrit :  
>>>>>> hey guys,  
>>>>>> so attached you find text files with @code_native output for
>>> the  
>>>>>> instructions   
>>>>>> - r * x[1,:]  
>>>>>> - cis(imexp)  
>>>>>> - sum(imexp) * sum(conj(imexp))  
>>>>>>   
>>>>>> for julia 0.5.   
>>>>>>   
>>>>>> Hardware I run on is a Haswell i5 machine, a Haswell i7
>>> machine, 
>>>>> and  
>>>>>> a IvyBridge i5 machine. Turned out on an Haswell i5 machine
>>> the 
>>>>> code  
>>>>>> also runs fast. Only the Haswell i7 machine is the slow one. 
>>>>> This  
>>>>>> really drove me nuts. First I thought it was the OS, then
>>> the  
>>>>>> architecture, and now its just from i5 to i7.... Anyways, I 
>>>>> don't  
>>>>>> know anything about x86 assembly, but the julia 0.45 code is
>>> the 
>>>>> same  
>>>>>> on all machines. However, for the dot product, the 0.5 code
>>> has  
>>>>>> already 2 different instructions on the i5 vs. the i7 (line 
>>>>> 44&47).  
>>>>>> For the cis call also (line 149...). And the IvyBridge i5
>>> code 
>>>>> is  
>>>>>> similar to the Haswell i5. I included also versioninfo() at
>>> the 
>>>>> top  
>>>>>> of the file. So you could just look at a vimdiff of the
>>> julia0.5  
>>>>>> files... Can anyone make sense out of this?  
>>>>> I'm definitely not an expert in assembly, but that additional
>>> leaq  
>>>>> instruction on line 44, and the additional movq instructions
>>> on 
>>>>> line  
>>>>> 111, 151 and 152 really look weird  
>>>>>  
>>>>> Could you do the same test with the binary tarballs? If the 
>>>>> difference  
>>>>> persists, you should open an issue on GitHub to track this.  
>>>>>  
>>>>> BTW, please wrap the fist call in a function to ensure it is  
>>>>> specialized for the arguments types, i.e.:  
>>>>>  
>>>>> f(r, x) = r * x[1,:]  
>>>>> @code_native f(r, x)  
>>>>>  
>>>>> Also, please check whether you still see the difference with
>>> this 
>>>>> code:  
>>>>> g(r, x) = r * x  
>>>>> @code_native g(r, x[1,:])  
>>>>>  
>>>>> What are the types of r and x? Could you provide a simple 
>>>>> reproducible example with dummy values?  
>>>>>  
>>>>>> The binary tarballs I will still test. If I remove the cis() 
>>>>> call,  
>>>>>> the difference is hard to tell, the loop is ~10times faster
>>> and 
>>>>> more  
>>>>>> or less all around 5ms. For the whole loop with cis() call,
>>> from 
>>>>> i5  
>>>>>> to i7 the difference is ~ 50ms on i5 to 90ms on i7.  
>>>>>>   
>>>>>> Shall I also post the julia 0.4 code?  
>>>>> If it's identical for all machines, I don't think it's
>>> needed.  
>>>>>  
>>>>>  
>>>>> Regards  
>>>>>  
>>>>>  
>>>>>> cheers, Johannes  
>>>>>>   
>>>>>>   
>>>>>>   
>>>>>>> Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a 
>>>>> écrit :   
>>>>>>>>    
>>>>>>>>    
>>>>>>>>> Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner
>>> a 
>>>>> écrit :    
>>>>>>>>>> Sorry for not having expressed myself clearly, I
>>> meant 
>>>>> the latest    
>>>>>>>>>> version of fedora to work fine (24 development). I
>>> always 
>>>>> used the    
>>>>>>>>>> latest julia nightly available on the copr nalimilan 
>>>>> repo. Right now    
>>>>>>>>>> that is: 0.5.0-dev+3292, Commit 9d527c5*, all use    
>>>>>>>>>> LLVM: libLLVM-3.7.1 (ORCJIT, haswell)    
>>>>>>>>>>     
>>>>>>>>>> peakflops on all machines (hardware identical) is 
>>>>> ~1.2..1.5e11.      
>>>>>>>>>>     
>>>>>>>>>> Fedora 22&23 with julia 0.5 is ~50% slower then 0.4,
>>> only 
>>>>> on fedora    
>>>>>>>>>> 24 julia 0.5 is  faster compared to julia 0.4.    
>>>>>>>>> Could you try to find a simple code to reproduce the 
>>>>> problem? In    
>>>>>>>>> particular, it would be useful to check whether this
>>> comes 
>>>>> from    
>>>>>>>>> OpenBLAS differences or whether it also happens with
>>> pure 
>>>>> Julia code    
>>>>>>>>> (typical operations which depend on BLAS are matrix 
>>>>> multiplication, as    
>>>>>>>>> well as most of linear algebra). Normally, 0.4 and 0.5 
>>>>> should use the    
>>>>>>>>> same BLAS, but who knows...    
>>>>>>>> well thats what I did, and the 3 simple calls inside the
>>> loop 
>>>>> are   
>>>>>>>> more or less same speed. only the whole loop seems
>>> slower. 
>>>>> See my   
>>>>>>>> code sample fromanswer march 8th (code gets in same 
>>>>> proportions   
>>>>>>>> faster when exp(im .* dotprods) is replaced by
>>> cis(dotprods) 
>>>>> ).    
>>>>>>>> So I don't know what I can do then...     
>>>>>>> Sorry, somehow I had missed that message. This indeed
>>> looks 
>>>>> like a code   
>>>>>>> generation issue in Julia/LLVM.   
>>>>>>>   
>>>>>>>>> Can you also confirm that all versioninfo() fields are
>>> the 
>>>>> same for all    
>>>>>>>>> three machines, both for 0.4 and 0.5? We must envision
>>> the 
>>>>> possibility    
>>>>>>>>> that the differences actually come from 0.4.    
>>>>>>>> ohoh, right! just noticed that my fedora 24 machine was
>>> an 
>>>>> ivy bridge   
>>>>>>>> which works fast:   
>>>>>>>>    
>>>>>>>> Julia Version 0.5.0-dev+3292   
>>>>>>>> Commit 9d527c5* (2016-03-28 06:55 UTC)   
>>>>>>>> Platform Info:   
>>>>>>>>   System: Linux (x86_64-redhat-linux)   
>>>>>>>>   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz   
>>>>>>>>   WORD_SIZE: 64   
>>>>>>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY
>>> Sandybridge)   
>>>>>>>>   LAPACK: libopenblasp.so.0   
>>>>>>>>   LIBM: libopenlibm   
>>>>>>>>   LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)   
>>>>>>>>    
>>>>>>>> and the other ones with fed22/23 are haswell, which work 
>>>>> slow:   
>>>>>>>>    
>>>>>>>> Julia Version 0.5.0-dev+3292   
>>>>>>>> Commit 9d527c5* (2016-03-28 06:55 UTC)   
>>>>>>>> Platform Info:   
>>>>>>>>   System: Linux (x86_64-redhat-linux)   
>>>>>>>>   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz   
>>>>>>>>   WORD_SIZE: 64   
>>>>>>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)   
>>>>>>>>   LAPACK: libopenblasp.so.0   
>>>>>>>>   LIBM: libopenlibm   
>>>>>>>>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)   
>>>>>>>>    
>>>>>>>> I just booted an fedora 23 on the ivy bridge machine and
>>> it's 
>>>>> also fast.    
>>>>>>>>     
>>>>>>>> Now if I use julia 0.45 on both architectures:   
>>>>>>>>    
>>>>>>>> Julia Version 0.4.5   
>>>>>>>> Commit 2ac304d* (2016-03-18 00:58 UTC)   
>>>>>>>> Platform Info:   
>>>>>>>>   System: Linux (x86_64-redhat-linux)   
>>>>>>>>   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz   
>>>>>>>>   WORD_SIZE: 64   
>>>>>>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)   
>>>>>>>>   LAPACK: libopenblasp.so.0   
>>>>>>>>   LIBM: libopenlibm   
>>>>>>>>   LLVM: libLLVM-3.3   
>>>>>>>>    
>>>>>>>> and:   
>>>>>>>>    
>>>>>>>> Julia Version 0.4.5   
>>>>>>>> Commit 2ac304d* (2016-03-18 00:58 UTC)   
>>>>>>>> Platform Info:   
>>>>>>>>   System: Linux (x86_64-redhat-linux)   
>>>>>>>>   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz   
>>>>>>>>   WORD_SIZE: 64   
>>>>>>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY
>>> Sandybridge)   
>>>>>>>>   LAPACK: libopenblasp.so.0   
>>>>>>>>   LIBM: libopenlibm   
>>>>>>>>   LLVM: libLLVM-3.3   
>>>>>>>>    
>>>>>>>> there is no speed difference apart from the ~10% or so
>>> from 
>>>>> the   
>>>>>>>> faster haswell machine. So could perhaps be haswell
>>> hardware 
>>>>> target   
>>>>>>>> specific with the change from llvm 3.3 to 3.7.1? Is
>>> there 
>>>>> anything   
>>>>>>>> else I could provide?   
>>>>>>> This is certainly an interesting finding. Could you paste 
>>>>> somewhere the   
>>>>>>> output of @code_native for your function on Sandybridge
>>> vs. 
>>>>> Haswell,   
>>>>>>> for both 0.4 and 0.5?   
>>>>>>>   
>>>>>>> It would also be useful to check whether the same
>>> difference 
>>>>> appears if   
>>>>>>> you use the generic binary tarballs from http://julialang.o
>>> rg/d 
>>>>> ownloads   
>>>>>>> .   
>>>>>>>   
>>>>>>> Finally, do you get the same result if you remove the call
>>> to 
>>>>> exp()   
>>>>>>> from the loop? (This is the only external function, so it 
>>>>> shouldn't be   
>>>>>>> affected by changes in Julia.)   
>>>>>>>   
>>>>>>>   
>>>>>>> Regards   
>>>>>>>   
>>>>>>>   
>>>>>>>> Best, Johannes   
>>>>>>>>    
>>>>>>>>>  Regards    
>>>>>>>>    
>>>>>>>>    
>>>>>>>>>> Le mercredi 16 mars 2016 à 09:25 -0700, Johannes
>>> Wagner a 
>>>>> écrit :     
>>>>>>>>>>> just a little update. Tested some other fedoras:
>>> Fedora 
>>>>> 22 with llvm     
>>>>>>>>>>> 3.8 is also slow with julia 0.5, whereas a fedora
>>> 24 
>>>>> branch with llvm     
>>>>>>>>>>> 3.7 is faster on julia 0.5 compared to julia 0.4,
>>> as it 
>>>>> should be     
>>>>>>>>>>> (speedup from inner loop parts translated into
>>> speedup 
>>>>> to whole     
>>>>>>>>>>> function).     
>>>>>>>>>>>      
>>>>>>>>>>> don't know if anyone cares about that... At least
>>> the 
>>>>> latest version     
>>>>>>>>>>> seems to work fine, hope it stays like this into
>>> the 
>>>>> final fedora 24     
>>>>>>>>>> What's the "latest version"? git built from source or
>>> RPM 
>>>>> nightlies?     
>>>>>>>>>> With which LLVM version for each?     
>>>>>>>>>>     
>>>>>>>>>> If from the RPMs, I've switched them to LLVM 3.8 for
>>> a 
>>>>> few days, and     
>>>>>>>>>> went back to 3.7 because of a build failure. So that 
>>>>> might explain the     
>>>>>>>>>> difference. You can install the last version which
>>> built 
>>>>> with LLVM 3.8     
>>>>>>>>>> manually from here:     
>>>>>>>>>> https://copr-be.cloud.fedoraproject.org/results/nalim
>>> ilan 
>>>>> /julia-nightlies/fedora-23-x86_64/00167549-julia/     
>>>>>>>>>>     
>>>>>>>>>> It would be interesting to compare it with the
>>> latest 
>>>>> nightly with 3.7.     
>>>>>>>>>>     
>>>>>>>>>>     
>>>>>>>>>> Regards     
>>>>>>>>>>     
>>>>>>>>>>     
>>>>>>>>>>     
>>>>>>>>>>>> hey guys,     
>>>>>>>>>>>> I just experienced something weird. I have some
>>> code 
>>>>> that runs fine     
>>>>>>>>>>>> on 0.43, then I updated to 0.5dev to test the
>>> new 
>>>>> Arrays, run same     
>>>>>>>>>>>> code and noticed it got about ~50% slower. Then
>>> I 
>>>>> downgraded back     
>>>>>>>>>>>> to 0.43, ran the old code, but speed remained
>>> slow. I 
>>>>> noticed while     
>>>>>>>>>>>> reinstalling 0.43, openblas-threads didn't get 
>>>>> isntalled along with     
>>>>>>>>>>>> it. So I manually installed it, but no
>>> change.      
>>>>>>>>>>>> Does anyone has an idea what could be going on?
>>> LLVM 
>>>>> on fedora23 is     
>>>>>>>>>>>> 3.7     
>>>>>>>>>>>>      
>>>>>>>>>>>> Cheers, Johannes     
>>>>>>>>>>>>

Re: [julia-users] regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

Reply via email to