I can't replicate the the 2x difference. The C is faster for me. But I
have gcc 4.8.2, not gcc 4.9.1. Nonetheless, the experiment points out
where Julia is missing a loop optimization that Clang and gcc get. Here is
a summary of combinations that I tried on a i7-4770 @ 3.40 GHz.
- Julia 0.3.5: *70* nsec. Inner loop is:
L82: vmulsd XMM3, XMM1, XMM2
vmulsd XMM4, XMM1, XMM1
vsubsd XMM4, XMM4, XMM0
vdivsd XMM3, XMM4, XMM3
vaddsd XMM1, XMM1, XMM3
vmulsd XMM3, XMM1, XMM1
vsubsd XMM3, XMM3, XMM0
vmovq RDX, XMM3
and RDX, RAX
vmovq XMM3, RDX
vucomisd XMM3, QWORD PTR [RCX]
ja L82
- Julia trunk from around Jan 19 + LLVM 3.5: *61 *nsec. Inner loop is:
L80: vmulsd xmm4, xmm1, xmm1
vsubsd xmm4, xmm4, xmm0
vmulsd xmm5, xmm1, xmm2
vdivsd xmm4, xmm4, xmm5
vaddsd xmm1, xmm1, xmm4
vmulsd xmm4, xmm1, xmm1
vsubsd xmm4, xmm4, xmm0
vandpd xmm4, xmm4, xmm3
vucomisd xmm4, qword ptr [rax]
ja L80
The abs is done more efficiently than for Julia 0.3.5 because of PR #8364. LLVM
missed a CSE opportunity here because of loop rotation: the last vmulsd of
each iteration computes the same thing as the first vmulsd of the next
iteration.
- C code compiled with gcc 4.8.2, using gcc -O2 -std=c99 -march=native
-mno-fma: *46 *nsec
.L11:
vaddsd %xmm1, %xmm1, %xmm3
vdivsd %xmm3, %xmm2, %xmm2
vsubsd %xmm2, %xmm1, %xmm1
vmulsd %xmm1, %xmm1, %xmm2
vsubsd %xmm0, %xmm2, %xmm2
vmovapd %xmm2, %xmm3
vandpd %xmm5, %xmm3, %xmm3
vucomisd %xmm4, %xmm3
ja .L11
Multiply by 2 (5 clock latency) has been replaced by add-to-self (3 clock
latency). It picked up the CSE opportunity. Only 1 vmulsd per iteration!
- C code compiled with clang 3.5.0, using clang -O2 -march=native: *46 *
nsec
.LBB1_3: # %while.body
# =>This Inner Loop Header: Depth=1
vmulsd %xmm3, %xmm1, %xmm5
vdivsd %xmm5, %xmm2, %xmm2
vaddsd %xmm2, %xmm1, %xmm1
vmulsd %xmm1, %xmm1, %xmm2
vsubsd %xmm0, %xmm2, %xmm2
vandpd %xmm4, %xmm2, %xmm5
vucomisd .LCPI1_1(%rip), %xmm5
ja .LBB1_3
Clang picks up the CSE opportunity but misses the add-to-self opportunity
(xmm3=-2.0). It's also using LLVM.
We should check why Julia is missing the CSE opportunity. Maybe Clang is
running a pass that handles CSE for a rotated loop? Though looking at the
Julia pass list, it looks like CSE runs before loop rotation. Needs more
investigation.
- Arch