Performance updates

bearophile Thu, 13 Nov 2008 03:20:20 -0800

In this post I show few things I have found/collected in the last weeks related 
to the performance of the code compiled with DMD.


I have added two of them to the list of slow things (as tiny benchmarks) I keep:
http://www.fantascienza.net/leonardo/js/slow_d.zip

------------------------

This is D code that just performs many integer divisions (by 7, known at 
compile time):

int divideBySeven(int x) {
    return x / 7;
}

void main() {
    int i = int.max;
    int r;
    while (i--)
        r = divideBySeven(i);
    printf("%d\n", r);
}


The same code in C:

#include "limits.h"
#include "stdio.h"

int divideBySeven(int x) {
    return x / 7;
}

int main() {
    int i = INT_MAX;
    int r;
    while (i--)
        r = divideBySeven(i);

    printf("%d\n", r);
    return 0;
}


I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the D 
code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz.

Timing results:
intdiv:
  C:  5.04 s
  D: 39.32 s


The ASM generated by DMD for the function divideBySeven():

mov     ECX,7
cdq
idiv    ECX


The ASM generated by GCC for the function divideBySeven():

pushl   %ebp
movl    %esp, %ebp
movl    8(%ebp), %ecx
pushl   %ebx
movl    $-1840700269, %ebx
movl    %ebx, %eax
popl    %ebx
imull   %ecx
popl    %ebp
leal    (%edx,%ecx), %eax
sarl    $2, %eax
sarl    $31, %ecx
subl    %ecx, %eax

--------------------------

This is a scanning, C code:

#include "stdio.h"

int main() {
    int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0;
    int i = 300000000;
    while (i--) {
        // 0.63 s
        if (i % 4 == 0) {
            counter0++;
        } else if (i % 4 == 1) {
            counter1++;
        } else if (i % 4 == 2) {
            counter2++;
        } else {
            counter3++;
        }
    }

    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
    return 0;
}


Equal D code:

import std.stdio: printf;

int main() {
    int counter0, counter1, counter2, counter3;

    int i = 300_000_000;
    while (i--) { // 1.23 s
        if (i % 4 == 0) {
            counter0++;
        } else if (i % 4 == 1) {
            counter1++;
        } else if (i % 4 == 2) {
            counter2++;
        } else {
            counter3++;
        }
    }

    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
    return 0;
}


Timings:
Scan (i = 300_000_000):
  C: 0.63 s
  D: 1.23 s

I can offer Asm code on request.

------------------------

The D1 docs strongly suggest to use foreach every time it's possible, avoiding 
to use the less handy for(), so for almost a year I have assumed foreach is as 
fast as the for() but this two versions shows differences:

20.66 seconds:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1

My classless version, 25.91 seconds:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2

That second version uses the following code in the most critical spot of the 
program:

void advance(TyPlanets bodies, double dt) {
    foreach(idx, ref b; bodies) {
        double bm = b.mass;
        foreach(ref b2; bodies[idx + 1 .. length]) {

Removing the foreach, replacing them with for loops like the following, gives a 
significant performance increase, the code runs in about 19.5 second (estimated 
timing of the Shootout computer, I can give you exact timings on my PC if you 
want):

static void advance(TyPlanets bodies, double dt) {
    for (int idx; idx < bodies.length; idx++) {
        auto b = &(bodies[idx]);
        double bm = b.mass;
        for (int j = idx + 1; j < bodies.length; j++) {
            auto b2 = &(bodies[j]);
            
(Note that I haven't just replaced foreach with a for, I have also removed the 
slicing bodies[idx+1..$], that also slows down the code a little, but I have 
seen that even removing just the slice the code is slower anyway).

------------------------

Looking at assembly produced by the Intel compiler from C code that contains 
many floating point operations I can see how short and slick it is. I have also 
read few articles that show various ways to manage the floating point stack as 
efficiently as possible. Some of such methods are slow or quite slow but 
produce more efficient code.
Using those refined methods for all the FP operations of a large program may 
lead to too much long compilation times. So profile-guided optimization may be 
used to find what are the hot parts of the code, to optimize the FP operations 
in them expecially well, and compile all the other FP parts with a faster 
compilation method.

GCC has also the "hot" function attribute to let programmers manually define 
what functions are more critical:
http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html

In D in theory it can be used something like this:
optimize { ... }
(As static if { ... } it doesn't create a new scope).

It tells the compiler that a part of code that uses lot of FP operations is 
speed critical, so the compiler can compile it with a quite slower floating 
point stack allocation algorithm, like ones I have read about.

Bye,
bearophile

Performance updates

Reply via email to