Here's the Julia assembly:

function squareroot(x)
    it = x
    err = it*it-x # line 5
    while abs(err) > 1e-13
        it = it - (err)/(2it) # line 7
        err = it*it-x
    end
    return it
end
Source line: 5
 push RBP
 mov RBP, RSP
Source line: 5
 vmulsd XMM1, XMM0, XMM0
 vsubsd XMM2, XMM1, XMM0
Source line: 6
 vmovq RCX, XMM2
 movabs RAX, 9223372036854775807
 and RCX, RAX
 vmovq XMM1, RCX
 movabs RCX, 140067103257920
 vucomisd XMM1, QWORD PTR [RCX]
 ja 9
 vmovaps XMM1, XMM0
 jmpq 61
 movabs RDX, 140067103257928
 vmovsd XMM3, QWORD PTR [RDX]
 vmovaps XMM1, XMM0
Source line: 7
 vmulsd XMM4, XMM1, XMM3
 vdivsd XMM2, XMM2, XMM4
 vaddsd XMM1, XMM1, XMM2
Source line: 8
 vmulsd XMM2, XMM1, XMM1
 vsubsd XMM2, XMM2, XMM0
 vmovq RDX, XMM2
 and RDX, RAX
 vmovq XMM4, RDX
 vucomisd XMM4, QWORD PTR [RCX]
 ja -43
Source line: 10
 vmovaps XMM0, XMM1
 pop RBP
 ret

And for C:
double squareroot(double x)
{
    double it = x;
    double err = it*it-x;
    while (fabs(err) > 1e-13) {
        it = it - (err)/(2*it);
        err = it*it-x;
    }
    return it;
}
00000000004007d0 <squareroot>:
  4007d0: c5 fb 59 c8           vmulsd %xmm0,%xmm0,%xmm1
  4007d4: c5 f9 28 d0           vmovapd %xmm0,%xmm2
  4007d8: c5 fb 10 2d e0 01 00 vmovsd 0x1e0(%rip),%xmm5        # 4009c0 
<_IO_stdin_used+0x70>
  4007df: 00 
  4007e0: c5 fb 10 25 a8 01 00 vmovsd 0x1a8(%rip),%xmm4        # 400990 
<_IO_stdin_used+0x40>
  4007e7: 00 
  4007e8: c5 f3 5c c8           vsubsd %xmm0,%xmm1,%xmm1
  4007ec: c5 f9 28 d9           vmovapd %xmm1,%xmm3
  4007f0: c5 e1 54 dd           vandpd %xmm5,%xmm3,%xmm3
  4007f4: c5 f9 2e dc           vucomisd %xmm4,%xmm3
  4007f8: 76 28                 jbe    400822 <squareroot+0x52>
  4007fa: 66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)
  400800: c5 eb 58 da           vaddsd %xmm2,%xmm2,%xmm3
  400804: c5 f3 5e cb           vdivsd %xmm3,%xmm1,%xmm1
  400808: c5 eb 5c d1           vsubsd %xmm1,%xmm2,%xmm2
  40080c: c5 eb 59 ca           vmulsd %xmm2,%xmm2,%xmm1
  400810: c5 f3 5c c8           vsubsd %xmm0,%xmm1,%xmm1
  400814: c5 f9 28 d9           vmovapd %xmm1,%xmm3
  400818: c5 e1 54 dd           vandpd %xmm5,%xmm3,%xmm3
  40081c: c5 f9 2e dc           vucomisd %xmm4,%xmm3
  400820: 77 de                 ja     400800 <squareroot+0x30>
  400822: c5 f9 28 c2           vmovapd %xmm2,%xmm0
  400826: c3                   retq   
  400827: 66 0f 1f 84 00 00 00 nopw   0x0(%rax,%rax,1)
  40082e: 00 00

I'm definitely not trying to making the fastest possible sqrt 
implementation. This is just for benchmarking purposes.

Thanks,
Miles

On Tuesday, January 27, 2015 at 7:01:29 PM UTC-5, Erik Schnetter wrote:
>
> I would begin by looking at the generated assembler code. You would use 
> "objdump -d" for C, and the "code_native" function for Julia. Can you post 
> the results? 
>
> The discussion below is beside your points, but nevertheless: 
>
> Your sqrt implementation is very slow for two reasons: 
>
> (1) You are using a while loop to check for termination. If you use 
> exponent manipulation, then your starting guess can be within sqrt(2) of 
> the result, and then you can determine the necessary number of iterations 
> ahead of time. You probably need 5 or 6 iterations for the accuracy you 
> chose. 
>
> (2) You are using a division in your algorithm, where divisions are 
> expensive. If you calculated 1/sqrt(x) instead of sqrt(x), then you don't 
> need a division. Later you can calculate sqrt(x) = x * (1/sqrt(x)). 
>
> A good sqrt routine should take less than 10^-8 seconds. 
>
> -erik 
>

Reply via email to