Re: Two optimizations

davidl Mon, 27 Jul 2009 19:45:13 -0700

在 Tue, 28 Jul 2009 04:51:42 +0800，bearophile <[email protected]>写道:

High-level optimizations have to be done mostly by the front-end becausethe back-end usually doesn't know about the high-level constructs of D.
So this is the right place to ask for some of such optimizations, whileasking for them on the LLVM or LDC IRC channels is not the best thing todo (even if LDC developers usually try to do what I ask them, they arevery gentle).
---------------------
Recently here I've shown a link to some of the optimizations done by theScala compiler (that the JavaVM back-end isn't able to perform), butthat post was ignored. The most important thing it does is to inlinemost closures of a Scala program. Such optimization is important forfunctional-style programming, because (like tail-call optimization) ifpresent allows the programmer to use certain idioms with more freedom(like passing delegates to higher order functions, etc), essentiallyallowing a different programming style. Probably this optimization isn'teasy to implement, it requires some code. Is someone interested inadding this to D?
---------------------
This comes from successive simplification of some code of mine, it's aloop with a kernel over an array, a common operation in my kind of code:
version(Tango) import tango.stdc.stdio: printf;
void main() {
    auto a = new int[30];
    for (int x = 2; x < a.length-2; x++)
        foreach (s; [-2, -1, 0, 1, 2])
            a[x] += a[x + s];
    printf("%d\n", a[5]);
}

This is the asm produced by LDC of the inner foreach loop:

.LBB1_1:
        movl    $4294967294, 132(%esp)
        movl    $4294967295, 136(%esp)
        movl    $0, 140(%esp)
        movl    $1, 144(%esp)
        movl    $2, 148(%esp)
        movl    20(%esp,%eax,4), %ecx
        addl    12(%esp,%eax,4), %ecx
        movl    %ecx, 20(%esp,%eax,4)
        addl    16(%esp,%eax,4), %ecx
        movl    %ecx, 20(%esp,%eax,4)
        addl    %ecx, %ecx
        movl    %ecx, 20(%esp,%eax,4)
        addl    24(%esp,%eax,4), %ecx
        movl    %ecx, 20(%esp,%eax,4)
        addl    28(%esp,%eax,4), %ecx
        movl    %ecx, 20(%esp,%eax,4)
        incl    %eax
        cmpl    $26, %eax
The [-2, -1, 0, 1, 2] array is immutable (and the variable 's' of theforeach isn't by ref), so can't the following initializations of sucharray moved out of the inner loop?
        movl    $4294967294, 132(%esp)
        movl    $4294967295, 136(%esp)
        movl    $0, 140(%esp)
        movl    $1, 144(%esp)
        movl    $2, 148(%esp)


This is a variant of the same code:

version(Tango) import tango.stdc.stdio: printf;
template Tuple(T...) { alias T Tuple; }
alias Tuple!(-2, -1, 0, 1, 2) move;
void main() {
    auto a = new int[30];
    for (int x = 2; x < a.length-2; x++)
        foreach (s; move)
            a[x] += a[x + s];
    printf("%d\n", a[5]);
}

The relative asm is much better:

main:
        pushl   %edi
        subl    $128, %esp
        xorl    %eax, %eax
        movl    $30, %ecx
        leal    8(%esp), %edi
        rep;stosl
        xorl    %eax, %eax
        .align  16
.LBB1_1:
        movl    16(%esp,%eax,4), %ecx
        addl    8(%esp,%eax,4), %ecx
        movl    %ecx, 16(%esp,%eax,4)
        addl    12(%esp,%eax,4), %ecx
        addl    %ecx, %ecx
        movl    %ecx, 16(%esp,%eax,4)
        addl    20(%esp,%eax,4), %ecx
        movl    %ecx, 16(%esp,%eax,4)
        addl    24(%esp,%eax,4), %ecx
        movl    %ecx, 16(%esp,%eax,4)
        incl    %eax
        cmpl    $26, %eax
        jne     .LBB1_1
        movl    28(%esp), %eax
        movl    %eax, 4(%esp)
        movl    $.str, (%esp)
        call    printf
        xorl    %eax, %eax
        addl    $128, %esp
        popl    %edi
        ret     $8
This is a less reduced version of the code, that I'd like the D (LDC)compiler to optimize very well:
version(Tango) import tango.stdc.stdio: printf;

struct P {
    int x, y;
    P opAdd(P o) { return P(x+o.x, y+o.y); }
}

struct Rect {
    int lx, ly;
    int opIn_r(P p) {
        return p.x >= 0 && p.x < lx && p.y >= 0 && p.y < ly;
    }
}

void main() {
    const int SIZE = 20;
    auto m = new int[][](SIZE, SIZE);
    auto p = P(10, 10);
    // there's another loop for the rows here
    for (int i; i < SIZE; i++)
foreach (shift;[P(-2,-1),P(-1,-2),P(1,-2),P(2,-1),P(2,1),P(1,2),P(-1,2),P(-2,1)])
            if (shift + p in Rect(SIZE, SIZE))
                printf("OK\n");
}
Eventually the compiler can produce asm similar to (well, this uses anuint to avoid testing >= 0, I don't think the LDC compiler will soonlearn this trick too):
template Tuple(T...) { alias T Tuple; }
alias Tuple!(-1,-2,-2,-1,+1,+2,+2,+1) movex;
alias Tuple!(-2,-1,+1,+2,+2,+1,-1,-2) movey;
foreach (uint i, sx; movex)
    if (x + sx < SIZE && y + movey[i] < SIZE)
        printf("OK\n");


If you need more info please ask.

Bye,
bearophile

It's better to be in bugzilla. I think this is not at the highest priorityright now.

Simple implementation idea is checking everything in a loop to see if somecould be compile-time constants. Then rewrite the loop when furtheroptimization flags are supplied to the compiler. This could mess up theLoC information for debugging.


--
使用 Opera 革命性的电子邮件客户程序: http://www.opera.com/mail/

Re: Two optimizations

Reply via email to