Andrei:

>I optimized things such that the commonly used path (many calls to empty, 
>front, and popFront) is as fast as possible. The initial work will be 
>amortized for most loops.<

New timings:

Running time, best of 3, seconds:
  test1:     0.31
  test1 opt: 0.07
  test2:     0.31
  test2 opt: 0.12
  test3:     6.38
  test3 opt: 0.52
  test3:     3.78  (new iota)
  test3 opt: 0.29  (new iota)
  test4:     4.70
  test4 opt: 1.25
  test4:     2.03 (new iota)
  test4 opt: 0.27 (new iota)

Compile times opt version, seconds:
  test1: 0.05
  test2: 0.05
  test3: 0.28
  test4: 0.23 (new iota)
  test4: 0.29
  test4: 0.24 (new iota)

(The difference in compile time between 0.29 and 0.23 is probably not 
significant. The difference between 0.05 and 0.23 is significant.)

In my opinion the non-opt runtime timings too are important, because you can't 
compile code every time in opt mode. The normal for/foreach gives some 
performance even with no optmized compilation.

It's good to optimize retro() so it can recognize the iota() argument and 
return just another iota().


>On my machine, test4 is still 2x slower than foreach_reverse in an optimized 
>build.<

As you see on mine in an optimized build it's a bit slower than 2x than 
foreach_reverse and more than 4x slower than a normal for loop.


>The disassembly reveals that the compiler refuses to inline some calls, which 
>is disappointing as their bodies are very simple.<

The code inside the loop here is very simple, it's just a variable (register) 
inc. In this case, in an optimized build GCC is able to optimize away the whole 
for loop. In my opinion a good D2 compiler has to to do the same with a 
foreach(retro(iota)) :-)

If I am not wrong this is the inner loop of test4 with the new iota in opt 
build:

L3F:            mov     ECX,018h[ESP]
                inc     EBX
                add     010h[ESP],ECX
                mov     EDX,010h[ESP]
                cmp     EDX,014h[ESP]
                jne     L3F


It's not very good, but it's better than before. All those memory loads and 
stores are not necessary in this loop, they waste time.

There is a call to this function, but it's before the call:

_D3std5range13__T4IotaTiTiZ4Iota6__ctorMFNciiiZS3std5range13__T4IotaTiTiZ4Iota  
comdat
L0:             sub     ESP,0Ch
                mov     ECX,offset 
FLAT:_D3std5range13__T4IotaTiTiZ4Iota6__ctorMFNciiiZS3std5range13__T4IotaTiTiZ4Iota14__dgliteral607MFZAxa
                push    EBX
                push    EBP
                mov     EBP,020h[ESP]
                push    ESI
                mov     ESI,EAX
                push    EDI
                mov     EDI,020h[ESP]
                test    EDI,EDI
                mov     014h[ESP],EAX
                mov     018h[ESP],ECX
                je      L28
                cmp     EBP,024h[ESP]
                jne     L2C
L28:            xor     EDX,EDX
                jmp short       L31
L2C:            mov     EDX,1
L31:            xor     DL,1
                je      L5A
                push    dword ptr FLAT:_DATA[054h]
                mov     EDX,ECX
                push    dword ptr FLAT:_DATA[050h]
                push    084Dh
                mov     EAX,020h[ESP]
                mov     EBX,020h[ESP]
                call    EDX
                push    EDX
                push    EAX
                call    near ptr _D3std9contracts7bailOutFAyaixAaZv
L5A:            test    EDI,EDI
                mov     [ESI],EBP
                mov     8[ESI],EDI
                jle     L77
                mov     EDX,024h[ESP]
                dec     EDX
                mov     EAX,EDX
                mov     4[ESI],EDX
                sub     EAX,EBP
                cdq
                idiv    EDI
                sub     4[ESI],EDX
                jmp short       L89
L77:            mov     ECX,024h[ESP]
                inc     ECX
                mov     EAX,ECX
                mov     4[ESI],ECX
                sub     EAX,EBP
                cdq
                idiv    EDI
                add     4[ESI],EDX
L89:            add     4[ESI],EDI
                mov     EAX,ESI
                pop     EDI
                pop     ESI
                pop     EBP
                pop     EBX
                add     ESP,0Ch
                ret     0Ch


It's a lot of code, that's the iota constructor. The 
_D3std9contracts7bailOutFAyaixAaZv is probably the enforce or part of it.

Doubling the loop size (N), the running time becomes about double, so such call 
to the constructor is not taking a lot of time (because there are no loops in 
the constructor).

-------------------

I have seen that iota() contains:

this(N current, N pastLast, S step) {
    enforce(step != 0 && current != pastLast);


Indeed this throws:

import std.range;
void main() {
    int count;
    foreach (x; iota(10, 10, 1))
        count++;
}


While his Python code gives no output:

>>> for x in xrange(10, 10, 1): print x
...
>>>

>>> for x in xrange(20, 10, 1): print x
...

In my opinion an empty loop is a valid loop, just as this is valid both 
syntactically and logically:

for (int i = 10; i < 10; i++) {}
for (int i = 20; i < 10; i++) {}

So I suggest to replace that line in the iota() costructor with:

enforce(step != 0);

And then create an empty generator if pastLast <= current (and the step is 1).

Bye,
bearophile

Reply via email to