Joseph Wakeling:

>My guess for the difference is that you ran less than the full 100 iterations 
>of the main for loop.<

You are right, sorry.


>So I think it's probably just compiler difference that's to blame for speed 
>differences<

This can be true, but such differences are not magic, they can be found, and 
eventually they can even become precise enhancement requests for llvm 
developers, they are open to such requests.


>After all, D in general and DMD in particular is in development.<

DMD has a quite old back-end that is not in active development. Maybe it will 
become 64 bit.


>I _am_ surprised that in your profiling the random number generator took up so 
>much time, as if you cut out the lines,
        aa(ratings, reputationUser, reputationObject);
        yzlm(ratings, reputationUser, reputationObject);
... and recompile, the program screams through in no time at all -- which would 
lead me to think that it's the yzlm opCall and the functions it calls that take 
up the majority of time.  Or is there some lazy evaluation going on here ... ?<

I've done few more tests, an I've seen that my profiling in DMD was wrong, I 
have disabled the inlining (to have a less aggregated result), but this has 
produced fake results, because in the true optimized run the inlining removes 
most of the overhead of the Kiss generator, so the true profiling is now like 
you say:

    Func  
    Time  

17703021  Yzlm.objectReputationUpdate
12709135  Yzlm.userReputationUpdate
 1385194  main


In this program I have seen that a good percentage of the speed difference 
between ldc/dmd is caused by foreach loops. When you iterate over structs 
longer than size_t use ref (or in D2 better ref const(...)).
foreach (Rating r; ratings)
==>
foreach (ref const(Rating) r; ratings)

There is nothing algorithmically strange going on here, but to be sure you can 
take a look at the produced asm.


This is the middle parts (the two loops) of Yzlm.objectReputationUpdate 
compiled with ldc, this is one of the two main hot spots of the program, the 
you can compared this asm with the asm of the same part from the C++ version:

...
.LBB6_2:
    movl    (%eax), %edx
    movl    52(%esp), %edi
    movl    4(%edi), %ebx
    movsd   (%ebx,%edx,8), %xmm0
    mulsd   8(%eax), %xmm0
    movl    4(%eax), %edx
    movl    48(%esp), %ebx
    movl    4(%ebx), %ebx
    addsd   (%ebx,%edx,8), %xmm0
    movsd   %xmm0, (%ebx,%edx,8)
    movl    4(%eax), %edx
    movl    36(%esi), %ebx
    movsd   (%ebx,%edx,8), %xmm0
    movl    (%eax), %ebp
    movl    4(%edi), %edi
    addsd   (%edi,%ebp,8), %xmm0
    movsd   %xmm0, (%ebx,%edx,8)
    addl    $16, %eax
    decl    %ecx
    jne .LBB6_2
.LBB6_3:
    movl    48(%esp), %eax
    movl    (%eax), %eax
    testl   %eax, %eax
    je  .LBB6_9
    movl    48(%esp), %ecx
    movl    4(%ecx), %ecx
    pxor    %xmm0, %xmm0
    xorl    %edx, %edx
    pxor    %xmm1, %xmm1
    .align  16
.LBB6_5:
    movl    36(%esi), %edi
    movsd   (%edi,%edx,8), %xmm2
    ucomisd %xmm1, %xmm2
    ja  .LBB6_7
    movsd   .LCPI6_0, %xmm2
.LBB6_7:
    movsd   (%ecx,%edx,8), %xmm3
    divsd   %xmm2, %xmm3
    movsd   %xmm3, (%ecx,%edx,8)
    movl    44(%esi), %edi
    subsd   (%edi,%edx,8), %xmm3
    mulsd   %xmm3, %xmm3
    addsd   %xmm3, %xmm0
    incl    %edx
    cmpl    %eax, %edx
    jne .LBB6_5
...



I've also seen that the program gets a little faster with some unrollig, done 
like this:

        const int UNROLL = 5;
        assert(!(ratings.length % UNROLL));
        assert(!(reputationUser.length % UNROLL));

        for (int i; i < ratings.length; i += UNROLL) {
            foreach (k; Range!(UNROLL)) {
                auto r = &(ratings[i + k]);
                auto aux = (r.value - reputationObject[r.object]);
                reputationUser[r.user] += aux * aux;
            }
        }


Where:

template Tuple(T...) {
    alias T Tuple;
}

template Range(int stop) {
    static if (stop <= 0)
        alias Tuple!() Range;
    else
        alias Tuple!(Range!(stop-1), stop-1) Range;
}


But to make it work with a arrays of arbitrary length you need a little more 
code, using a strategy similar to:

sum = 0;
for (i = 0; i < n - 3; i += 4)
  sum += A[i] + A[i+1] + A[i+2] + A[i+3];
for (; i < n; i++)
  sum += A[i];
  

>I don't think it's the algorithm in any case (although it might be Phobos' 
>implementation) as working with Tango and LDC, changing between the Twister or 
>Kiss random number generators doesn't make for any meaningful performance 
>difference.<

Those Phobos and Tango implementations, when you use uniform(), are probably 
different, there are enforce() and the nextafter() that are probably not used 
in that Tango code.


>I noticed that when compiled with LDC the memory usage stays constant, but in 
>DMD the memory use blows up.  Is this a D1/D2 difference (with D2 having the 
>more advanced features that require preservation of memory) or is it a bug in 
>one or the other of the compilers ... ?<

D2 dynamic arrays have being changed recently, D1 dynamic arrays have not 
changed. LDC is D1 still. So this is just a D1/D2 difference. The purpose of 
this change between D1 and D2 was to avoid bugs caused by "memory stomping" of 
array slices.

Bye,
bearophile

Reply via email to