"those instructions should all take one execution cycle each" is likely where your problem lies. Who says so? The STM8 is pipelined and given "cycle" counts assume that the first decode cycle of an instruction overlaps with the execution cycle of the previous instruction. This is only _mostly_ the case. Sometimes you get decode stalls for various reasons leading to an extra cycle being taken. There are two or three stalls in your code that stand out but I'd guess it's the read-after-write register stall implied by add/inc a followed by dec (0x.., sp) coupled with a difference in how often the add/inc is skipped over by the preceding jr that accounts for your difference.

Sadly the pipetrace functionality was removed from ucsim (why?) so if you want something more accurate than the no-stalls-and-everything-overlaps counts someone would have to work through it by hand.

Mike


On 18/12/2022 03:16, Basil Hussain wrote:
I have a setup where I am using the timer facility in uCsim to benchmark/profile the number of execution cycles of pieces of code. To explain the setup briefly, I create a timer as well as a breakpoint on writes to a GPIO port address, then with a breakpoint script, every time it breaks I stop the timer, get its value, reset it to zero, restart it, then continue sim execution. This allows me to bracket sections of my code to be benchmarked just by toggling the relevant GPIO port.

However, when recently looking at the results for two pieces of code that should be identical in terms of number of execution cycles of the assembly, I am actually seeing a discrepancy in the counted cycles.

Code 'A':
timer #0("benchmark") OFF 0.044375687499974 sec (710011 clks)

Code 'B':
timer #0("benchmark") OFF 0.045625625000010 sec (730010 clks)

I am at a loss to explain why there is a 20k cycle difference there. One thing that does correlate is that 20k is a multiple of the number of iterations of my benchmark testing, which is 10k. So something is counting an extra 2 cycles per iteration within the code.

Here are the pertinent pieces of code in question that I am trying to benchmark:

Code 'A':
_rotate_left_8:
    ld    a, (4 +1, sp)
    and    a, #0x07
    ld    (4 +1, sp), a
    ld    a, (4 +0, sp)
    tnz    (4 +1, sp)
    jreq    0003$
0001$:
    sll    a
    jrnc    0002$
    inc    a
0002$:
    dec    (4 +1, sp)
    jrne    0001$
0003$:
    retf

Code 'B':
_rotate_right_8:
    ld    a, (4 +1, sp)
    and    a, #0x07
    ld    (4 +1, sp), a
    ld    a, (4 +0, sp)
    tnz    (4 +1, sp)
    jreq    0003$
0001$:
    srl    a
    jrnc    0002$
    add    a, #0x80
0002$:
    dec    (4 +1, sp)
    jrne    0001$
0003$:
    retf

As you can see, the only differences are one "sll" vs "srl" instruction, and one "inc" vs "add" - the rest is identical. And those instructions should all take one execution cycle each. So there should be no difference in the total number of execution cycles between the two pieces of code.

I did think perhaps there may be a difference in benchmarking wrapper code that runs the code above for a specified number of iterations. This code is in C, so is at the mercy of SDCC's compilation for consistency of execution, so maybe differences exist. But, having checked that, I see no significant differences.

Benchmark wrapper assembly for code 'A':
    bset    0x500a, #5
    ldw    x, #0x2710
00122$:
    ldw    y, x
    decw    x
    tnzw    y
    jreq    00125$
    pushw    x
    push    #0x06
    push    _benchmark_rotate_val_8_65536_344+0
    callf    _rotate_left_8
    addw    sp, #2
    popw    x
    jra    00122$
00125$:
    bres    0x500a, #5

Benchmark wrapper assembly for code 'B':
    bset    0x500a, #5
    ldw    x, #0x2710
00212$:
    ldw    y, x
    decw    x
    tnzw    y
    jreq    00215$
    pushw    x
    push    #0x06
    push    _benchmark_rotate_val_8_65536_344+0
    callf    _rotate_right_8
    addw    sp, #2
    popw    x
    jra    00212$
00215$:
    bres    0x500a, #5

You can see they are identical apart from the labels.

So, where is uCsim getting the differences in measured cycles from? I seem to recall that although cycle counts for some STM8 instructions were incorrect in older SDCC releases, they had been corrected quite a while ago - are some still incorrect? This is with uCsim 0.6.4 from SDCC 4.2.0.

Regards,
Basil Hussain


_______________________________________________
Sdcc-user mailing list
Sdcc-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sdcc-user


_______________________________________________
Sdcc-user mailing list
Sdcc-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sdcc-user

Reply via email to