I think I have it: 64 bit registers. I got ldc to work
in 32 bit (didn't have that yesterday, so I was doing 64 bit only)
and compiled.

No difference in timing between ldc 32 bit and dmd 32 bit.
The disassembly isn't identical but the time is. (The disassembly
seems to mainly order things differently, but ldc has fewer jump
instructions too.)

Anyway.

In 64 bit, ldc gets a speedup over dmd. Looking at the asm
output, it looks like dmd doesn't use any of the new registers,
whereas ldc does. (dmd's 64 bit looks mostly like 32 bit code with
r instead of e.)


Here's the program. It's based on one of the Python ones.

====
import std.bigint;
import std.stdio;

alias BigInt number;

void main() {
        auto N = 10000;

        number i, k, ns;
        number k1 = 1;
        number n,a,d,t,u;
        n = 1;
        d = 1;
        while(1) {
                k += 1;
                t = n<<1;
                n *= k;
                a += t;
                k1 += 2;
                a *= k1;
                d *= k1;
                if(a >= n) {
                        t = (n*3 +a)/d;
                        u = (n*3 +a)%d;
                        u += n;
                        if(d > u) {
                                ns = ns*10 + t;
                                i += 1;
                                if(i % 10 == 0) {
                                        debug writefln ("%010d\t:%d", ns, i);
                                        ns = 0;
                                }
                                if(i >= N) {
                                        break;
                                }
                                a -= d*t;
                                a *= 10;
                                n *= 10;
                        }
                }
        }
}
=====

BigInt's calls aren't inlined, but that's a frontend issue. Let's
eliminate that by switching to long in that alias.

The result will be wrong, but that's beside the point for now. I
just want to see integer math. (this is why the writefln is debug
too)

With optimizations turned on, ldc again wins by the same ratio -
it runs in about 2/3 the time - and the code is much easier to look
at.


Let's see what's going on.


The relevant loop from DMD (64 bit):

===
L47:            inc     qword ptr -040h[RBP]
                mov     RAX,-028h[RBP]
                add     RAX,RAX
                mov     -010h[RBP],RAX
                mov     RAX,-040h[RBP]
                imul    RAX,-028h[RBP]
                mov     -028h[RBP],RAX
                mov     RAX,-010h[RBP]
                add     -020h[RBP],RAX
                add     qword ptr -030h[RBP],2
                mov     RAX,-030h[RBP]
                imul    RAX,-020h[RBP]
                mov     -020h[RBP],RAX
                mov     RAX,-030h[RBP]
                imul    RAX,-018h[RBP]
                mov     -018h[RBP],RAX
                mov     RAX,-020h[RBP]
                cmp     RAX,-028h[RBP]
                jl      L47
                mov     RAX,-028h[RBP]
                lea     RAX,[RAX*2][RAX]
                add     RAX,-020h[RBP]
                mov     -058h[RBP],RAX
                cqo
                idiv    qword ptr -018h[RBP]
                mov     -010h[RBP],RAX
                mov     RAX,-058h[RBP]
                cqo
                idiv    qword ptr -018h[RBP]
                mov     -8[RBP],RDX
                mov     RAX,-028h[RBP]
                add     -8[RBP],RAX
                mov     RAX,-018h[RBP]
                cmp     RAX,-8[RBP]
                jle     L47
                mov     RAX,-038h[RBP]
                lea     RAX,[RAX*4][RAX]
                add     RAX,RAX
                add     RAX,-010h[RBP]
                mov     -038h[RBP],RAX
                inc     qword ptr -048h[RBP]
                mov     RAX,-048h[RBP]
                mov     RCX,0Ah
                cqo
                idiv    RCX
                test    RDX,RDX
                jne     L109
                mov     qword ptr -038h[RBP],0
L109:           cmp     qword ptr -048h[RBP],02710h
                jge     L137
                mov     RAX,-018h[RBP]
                imul    RAX,-010h[RBP]
                sub     -020h[RBP],RAX
                imul    EAX,-020h[RBP],0Ah
                mov     -020h[RBP],RAX
                imul    EAX,-028h[RBP],0Ah
                mov     -028h[RBP],RAX
                jmp       L47
===


and from ldc 64 bit:

====
L20:            add     RDI,2
                inc     RCX
                lea     R9,[R10*2][R9]
                imul    R9,RDI
                imul    R8,RDI
                imul    R10,RCX
                cmp     R9,R10
                jl      L20
                lea     RAX,[R10*2][R10]
                add     RAX,R9
                cqo
                idiv    R8
                add     RDX,R10
                cmp     R8,RDX
                jle     L20
                cmp     RSI,0270Fh
                jg      L73
                imul    RAX,R8
                sub     R9,RAX
                add     R9,R9
                lea     R9,[R9*4][R9]
                inc     RSI
                add     R10,R10
                lea     R10,[R10*4][R10]
                jmp short       L20
===


First thing that immediately pops out is the code is a lot shorter.
Second thing that jumps out is it looks like ldc makes better use
of the registers. Indeed, the shortness looks to be thanks to
the registers eliminating a lot of movs.



So I'm pretty sure the difference is caused by dmd not using the
new registers in x64. The other differences look trivial to my
eyes.

Reply via email to