I was using essentially the same timing logic as you, just scaled differently. I've put my code and a transcript at: https://gist.github.com/ArchRobison/0acab7f723b1357fa2fc . I'm running on an i7-4770. I tried a Sandy-Bridge EP (Xeon E5-?) and for it too, gcc/clang wer faster than Julia:
$ newton.clang 85.818481 nsec $ newton.gcc 91.744494 nsec $ julia newton.jl 114.298201 nsec - Arch On Wednesday, February 11, 2015 at 6:39:09 PM UTC-6, Miles Lubin wrote: > > Hi Arch, all, > > Thanks for looking into this, it's amazing to have experts here who > understand the depths of compilers. I'm stubbornly having difficulty > reproducing your timings, even though I see the same assembly generated for > clang. I've tried on an i5-3320M and on an E5-2650 and on both, Julia is > faster. How were you measuring the times in nanoseconds? > > Miles > > On Thursday, January 29, 2015 at 12:20:46 PM UTC-5, Arch Robison wrote: >> >> I can't replicate the the 2x difference. The C is faster for me. But I >> have gcc 4.8.2, not gcc 4.9.1. Nonetheless, the experiment points out >> where Julia is missing a loop optimization that Clang and gcc get. Here is >> a summary of combinations that I tried on a i7-4770 @ 3.40 GHz. >> >> - Julia 0.3.5: *70* nsec. Inner loop is: >> >> L82: vmulsd XMM3, XMM1, XMM2 >> >> vmulsd XMM4, XMM1, XMM1 >> >> vsubsd XMM4, XMM4, XMM0 >> >> vdivsd XMM3, XMM4, XMM3 >> >> vaddsd XMM1, XMM1, XMM3 >> >> vmulsd XMM3, XMM1, XMM1 >> >> vsubsd XMM3, XMM3, XMM0 >> >> vmovq RDX, XMM3 >> >> and RDX, RAX >> >> vmovq XMM3, RDX >> >> vucomisd XMM3, QWORD PTR [RCX] >> >> ja L82 >> >> >> - Julia trunk from around Jan 19 + LLVM 3.5: *61 *nsec. Inner loop >> is: >> >> L80: vmulsd xmm4, xmm1, xmm1 >> >> vsubsd xmm4, xmm4, xmm0 >> >> vmulsd xmm5, xmm1, xmm2 >> >> vdivsd xmm4, xmm4, xmm5 >> >> vaddsd xmm1, xmm1, xmm4 >> >> vmulsd xmm4, xmm1, xmm1 >> >> vsubsd xmm4, xmm4, xmm0 >> >> vandpd xmm4, xmm4, xmm3 >> >> vucomisd xmm4, qword ptr [rax] >> >> ja L80 >> >> >> >> The abs is done more efficiently than for Julia 0.3.5 because of PR #8364. >> LLVM missed a CSE opportunity here because of loop rotation: the last >> vmulsd of each iteration computes the same thing as the first vmulsd of the >> next iteration. >> >> >> - C code compiled with gcc 4.8.2, using gcc -O2 -std=c99 >> -march=native -mno-fma: *46 *nsec >> >> .L11: >> >> vaddsd %xmm1, %xmm1, %xmm3 >> >> vdivsd %xmm3, %xmm2, %xmm2 >> >> vsubsd %xmm2, %xmm1, %xmm1 >> >> vmulsd %xmm1, %xmm1, %xmm2 >> >> vsubsd %xmm0, %xmm2, %xmm2 >> >> vmovapd %xmm2, %xmm3 >> >> vandpd %xmm5, %xmm3, %xmm3 >> >> vucomisd %xmm4, %xmm3 >> >> ja .L11 >> >> >> Multiply by 2 (5 clock latency) has been replaced by add-to-self (3 clock >> latency). It picked up the CSE opportunity. Only 1 vmulsd per iteration! >> >> >> - C code compiled with clang 3.5.0, using clang -O2 -march=native: *46 >> *nsec >> >> .LBB1_3: # %while.body >> >> # =>This Inner Loop Header: >> Depth=1 >> >> vmulsd %xmm3, %xmm1, %xmm5 >> >> vdivsd %xmm5, %xmm2, %xmm2 >> >> vaddsd %xmm2, %xmm1, %xmm1 >> >> vmulsd %xmm1, %xmm1, %xmm2 >> >> vsubsd %xmm0, %xmm2, %xmm2 >> >> vandpd %xmm4, %xmm2, %xmm5 >> >> vucomisd .LCPI1_1(%rip), %xmm5 >> >> ja .LBB1_3 >> >> >> Clang picks up the CSE opportunity but misses the add-to-self opportunity >> (xmm3=-2.0). It's also using LLVM. >> We should check why Julia is missing the CSE opportunity. Maybe Clang >> is running a pass that handles CSE for a rotated loop? Though looking at >> the Julia pass list, it looks like CSE runs before loop rotation. Needs >> more investigation. >> >> >> - Arch >> >> >
