I can't replicate the the 2x difference.  The C is faster for me.  But I 
have gcc 4.8.2, not gcc 4.9.1.  Nonetheless, the experiment points out 
where Julia is missing a loop optimization that Clang and gcc get.  Here is 
a summary of combinations that I tried on a i7-4770 @ 3.40 GHz.

   - Julia 0.3.5: *70*  nsec.  Inner loop is:

L82:    vmulsd  XMM3, XMM1, XMM2

        vmulsd  XMM4, XMM1, XMM1

        vsubsd  XMM4, XMM4, XMM0

        vdivsd  XMM3, XMM4, XMM3

        vaddsd  XMM1, XMM1, XMM3

        vmulsd  XMM3, XMM1, XMM1

        vsubsd  XMM3, XMM3, XMM0

        vmovq   RDX, XMM3

        and     RDX, RAX

        vmovq   XMM3, RDX

        vucomisd XMM3, QWORD PTR [RCX]

        ja      L82


   - Julia trunk from around Jan 19 + LLVM 3.5: *61 *nsec.  Inner loop is:

L80:    vmulsd  xmm4, xmm1, xmm1

        vsubsd  xmm4, xmm4, xmm0

        vmulsd  xmm5, xmm1, xmm2

        vdivsd  xmm4, xmm4, xmm5

        vaddsd  xmm1, xmm1, xmm4

        vmulsd  xmm4, xmm1, xmm1

        vsubsd  xmm4, xmm4, xmm0

        vandpd  xmm4, xmm4, xmm3

        vucomisd xmm4, qword ptr [rax]

        ja      L80

 

The abs is done more efficiently than for Julia 0.3.5 because of PR #8364. LLVM 
missed a CSE opportunity here because of loop rotation: the last vmulsd of 
each iteration computes the same thing as the first vmulsd of the next 
iteration.  


   -  C code compiled with gcc 4.8.2, using gcc -O2 -std=c99 -march=native 
   -mno-fma: *46 *nsec

.L11:

        vaddsd  %xmm1, %xmm1, %xmm3

        vdivsd  %xmm3, %xmm2, %xmm2

        vsubsd  %xmm2, %xmm1, %xmm1

        vmulsd  %xmm1, %xmm1, %xmm2

        vsubsd  %xmm0, %xmm2, %xmm2

        vmovapd %xmm2, %xmm3

        vandpd  %xmm5, %xmm3, %xmm3

        vucomisd        %xmm4, %xmm3

        ja      .L11


Multiply by 2 (5 clock latency) has been replaced by add-to-self (3 clock 
latency).  It picked up the CSE opportunity.  Only 1 vmulsd per iteration!


   - C code compiled with clang 3.5.0, using clang -O2 -march=native: *46 *
   nsec

.LBB1_3:                                # %while.body

                                        # =>This Inner Loop Header: Depth=1

        vmulsd  %xmm3, %xmm1, %xmm5

        vdivsd  %xmm5, %xmm2, %xmm2

        vaddsd  %xmm2, %xmm1, %xmm1

        vmulsd  %xmm1, %xmm1, %xmm2

        vsubsd  %xmm0, %xmm2, %xmm2

        vandpd  %xmm4, %xmm2, %xmm5

        vucomisd        .LCPI1_1(%rip), %xmm5

        ja      .LBB1_3


Clang picks up the CSE opportunity but misses the add-to-self opportunity 
(xmm3=-2.0).   It's also using LLVM.  
We should check why Julia is missing the CSE opportunity.  Maybe Clang is 
running a pass that handles CSE for a rotated loop?  Though looking at the 
Julia pass list, it looks like CSE runs before loop rotation.  Needs more 
investigation.


- Arch 
 

Reply via email to