A little extr info to previous message.
On Sun, Sep 18, 2022 at 04:44:03PM +0200, Waldek Hebisch wrote:
> I took a look at Poplog (in)efficiency.  My starting point
> was example from HELP * EXTERNAL about thresholding.
<snip>
> Below
> are results on 3GHz Intel i-7 (all times real time in
> microseconds):
>
<snip>
> 1000x100, Pop11 intvec fast_subscrintvec => 1239
<snip> 
> I also tried popc adding version in Syspop11:
> 
> 1000x100, Pop11 fi_< => 1591
> 1000x100, Pop11 fast_subscrintvec => 702
> 1000x100, Pop11 intvec, Syspop => 91

I looked at difference between incremental compiler and popc.
One difference is that popc can directly use machine call
instruction, while incremental compiler first loads address
of routine to register and then is doing indirect call.
This seem to be unavoidable, because popc can assume that
relative locations of routines are fixed, while code
generated by incremental compiler can be moved by garbage
collector.  But this probably is small difference.

There is second difference: in popc version fi_< is done
via inline code, while incremental compiler uses function
call.  This probably is main source of difference in runtime.
Intrisingly incremental compiler contains optimization that
should convert fi_< to inline code.  In more detail, main
loop is:

    fast_for i from 1 to ii do
        if fast_subscrintvec(i, av) fi_< limit then
            0 -> fast_subscrintvec(i, av)
        endif
    endfor;

Using LIB showcode I see that produces following seqent of
operations for Poplog VM:

    PUSHQ   1
    POP     i
    GOTO    label_10
label_8:
    PUSH    i
    PUSH    av
    CALL    fast_subscrintvec
    PUSH    limit
    CALL    fi_<
    IFNOT   label_12
    PUSHQ   0
    PUSH    i
    PUSH    av
    UCALL   fast_subscrintvec
label_12:
label_11:
label_9:
    PUSH    i
    PUSHQ   1
    CALL    fi_+
    POP     i
label_10:
    PUSH    i
    PUSH    ii
    CALL    fi_>
    IFNOT   label_8
label_7:

CALL fi_> + IFNOT (which comes from fast_for) gets converted to
inline code.  CALL fi_< + IFNOT which could be handled in
similar way for some reason leads to actual function call.

I also tried Pop11 full vector, using incremental compiler this
gives:

1000x100, Pop11 full vector => 635

Using popc:

1000x100, Pop11 full vector => 118

With full vector array indexing and fi_< comparison are
uptimized to inline code.  As one can see popc version
is almost as good as Syspop one.   Incremental compiler
is much worse here: while it generates inline code this
code performs reads and stores to Polog user stack, and
that may be main reason for slowdown.  There is also another
possiblity: popc putc code checking for traps/stack overflow
in different place.  So it is possible that version
produced by incremental compiler suffers due to pathological
jump behaviour.

Concerning array indexing AFAICS reasons for performance
difference are clear: Poplog has optimization for access
to full vectors, but lacks similar optimization for
access to specialized vectors.  So, it would be good
to add extra inline optimization.

To eliminate access to Polog user stack incremental compiler
probably should use similar method like popc.  In fact, it
probably would be much better to have single compiler that
operates in two modes (incremental or batch) instead of current
two compilers.

-- 
                              Waldek Hebisch

Reply via email to