[julia-users] Re: benchmark with Julia 2x faster than C

Arch Robison Thu, 12 Feb 2015 08:07:41 -0800

I was using essentially the same timing logic as you, just scaled 
differently.  I've put my code and a transcript at: 
https://gist.github.com/ArchRobison/0acab7f723b1357fa2fc . I'm running on 
an i7-4770.  I tried a Sandy-Bridge EP (Xeon E5-?) and for it too, 
gcc/clang wer faster than Julia:


$ newton.clang
85.818481 nsec
$ newton.gcc
91.744494 nsec
$ julia newton.jl
114.298201 nsec

- Arch

On Wednesday, February 11, 2015 at 6:39:09 PM UTC-6, Miles Lubin wrote:
>
> Hi Arch, all,
>
> Thanks for looking into this, it's amazing to have experts here who 
> understand the depths of compilers. I'm stubbornly having difficulty 
> reproducing your timings, even though I see the same assembly generated for 
> clang. I've tried on an i5-3320M and on an E5-2650 and on both, Julia is 
> faster. How were you measuring the times in nanoseconds?
>
> Miles
>
> On Thursday, January 29, 2015 at 12:20:46 PM UTC-5, Arch Robison wrote:
>>
>> I can't replicate the the 2x difference.  The C is faster for me.  But I 
>> have gcc 4.8.2, not gcc 4.9.1.  Nonetheless, the experiment points out 
>> where Julia is missing a loop optimization that Clang and gcc get.  Here is 
>> a summary of combinations that I tried on a i7-4770 @ 3.40 GHz.
>>
>>    - Julia 0.3.5: *70*  nsec.  Inner loop is:
>>
>> L82:    vmulsd  XMM3, XMM1, XMM2
>>
>>         vmulsd  XMM4, XMM1, XMM1
>>
>>         vsubsd  XMM4, XMM4, XMM0
>>
>>         vdivsd  XMM3, XMM4, XMM3
>>
>>         vaddsd  XMM1, XMM1, XMM3
>>
>>         vmulsd  XMM3, XMM1, XMM1
>>
>>         vsubsd  XMM3, XMM3, XMM0
>>
>>         vmovq   RDX, XMM3
>>
>>         and     RDX, RAX
>>
>>         vmovq   XMM3, RDX
>>
>>         vucomisd XMM3, QWORD PTR [RCX]
>>
>>         ja      L82
>>
>>
>>    - Julia trunk from around Jan 19 + LLVM 3.5: *61 *nsec.  Inner loop 
>>    is:
>>
>> L80:    vmulsd  xmm4, xmm1, xmm1
>>
>>         vsubsd  xmm4, xmm4, xmm0
>>
>>         vmulsd  xmm5, xmm1, xmm2
>>
>>         vdivsd  xmm4, xmm4, xmm5
>>
>>         vaddsd  xmm1, xmm1, xmm4
>>
>>         vmulsd  xmm4, xmm1, xmm1
>>
>>         vsubsd  xmm4, xmm4, xmm0
>>
>>         vandpd  xmm4, xmm4, xmm3
>>
>>         vucomisd xmm4, qword ptr [rax]
>>
>>         ja      L80
>>
>>  
>>
>> The abs is done more efficiently than for Julia 0.3.5 because of PR #8364.
>>  LLVM missed a CSE opportunity here because of loop rotation: the last 
>> vmulsd of each iteration computes the same thing as the first vmulsd of the 
>> next iteration.  
>>
>>
>>    -  C code compiled with gcc 4.8.2, using gcc -O2 -std=c99 
>>    -march=native -mno-fma: *46 *nsec
>>
>> .L11:
>>
>>         vaddsd  %xmm1, %xmm1, %xmm3
>>
>>         vdivsd  %xmm3, %xmm2, %xmm2
>>
>>         vsubsd  %xmm2, %xmm1, %xmm1
>>
>>         vmulsd  %xmm1, %xmm1, %xmm2
>>
>>         vsubsd  %xmm0, %xmm2, %xmm2
>>
>>         vmovapd %xmm2, %xmm3
>>
>>         vandpd  %xmm5, %xmm3, %xmm3
>>
>>         vucomisd        %xmm4, %xmm3
>>
>>         ja      .L11
>>
>>
>> Multiply by 2 (5 clock latency) has been replaced by add-to-self (3 clock 
>> latency).  It picked up the CSE opportunity.  Only 1 vmulsd per iteration!
>>
>>
>>    - C code compiled with clang 3.5.0, using clang -O2 -march=native: *46 
>>    *nsec
>>
>> .LBB1_3:                                # %while.body
>>
>>                                         # =>This Inner Loop Header: 
>> Depth=1
>>
>>         vmulsd  %xmm3, %xmm1, %xmm5
>>
>>         vdivsd  %xmm5, %xmm2, %xmm2
>>
>>         vaddsd  %xmm2, %xmm1, %xmm1
>>
>>         vmulsd  %xmm1, %xmm1, %xmm2
>>
>>         vsubsd  %xmm0, %xmm2, %xmm2
>>
>>         vandpd  %xmm4, %xmm2, %xmm5
>>
>>         vucomisd        .LCPI1_1(%rip), %xmm5
>>
>>         ja      .LBB1_3
>>
>>
>> Clang picks up the CSE opportunity but misses the add-to-self opportunity 
>> (xmm3=-2.0).   It's also using LLVM.  
>> We should check why Julia is missing the CSE opportunity.  Maybe Clang 
>> is running a pass that handles CSE for a rotated loop?  Though looking at 
>> the Julia pass list, it looks like CSE runs before loop rotation.  Needs 
>> more investigation.
>>
>>
>> - Arch 
>>  
>>
>

[julia-users] Re: benchmark with Julia 2x faster than C

Reply via email to