On Tue, 29 Apr 2025, Alex Bennée wrote:
BALATON Zoltan <bala...@eik.bme.hu> writes:
On Mon, 28 Apr 2025, Richard Henderson wrote:
On 4/28/25 06:26, BALATON Zoltan wrote:
I have tried profiling the dst in real card vfio vram with dcbz
case (with 100 iterations instead of 10000 in above tests) but I'm
not sure I understand the results. vperm and dcbz show up but not
too high. Can somebody explain what is happening here and where the
overhead likely comes from? Here is the profile result I got:
Samples: 104K of event 'cycles:Pu', Event count (approx.):
122371086557
   Children      Self  Command          Shared Object            Symbol
-   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
cpu_exec_loop
    - 98.49% cpu_exec_loop
       - 98.48% cpu_tb_exec
          - 90.95% 0x7f4e705d8f15
               helper_ldub_mmu
               do_ld_mmio_beN
             - cpu_io_recompile
                - 45.79% cpu_loop_exit_noexc

I think the real problem is the number of loop exits due to i/o.  If
I'm reading this rightly, 45% of execution is in cpu_io_recompile.

I/O can only happen as the last insn of a translation block.

I'm not sure I understand this. A comment above cpu_io_recompile says
"In deterministic execution mode, instructions doing device I/Os must
be at the end of the TB." Is that wrong? Otherwise shouldn't this only
apply if running with icount or something like that?

That comment should be fixed. It used to only be the case for icount
mode but there was another race bug that meant we need to honour device
access as the last insn for both modes.


When we detect that it has happened in the middle of a translation
block, we abort the block, compile a new one, and restart execution.

Where does that happen? The calls of cpu_io_recompile in this case
seem to come from io_prepare which is called from do_ld16_mmio_beN if
(!cpu->neg.can_do_io) but I don't see how can_do_io is set.

Inline by set_can_do_io()

That one I've found but don't know where the cpu_loop_exit returns from the end of cpu_io_recompile.

Where this becomes a bottleneck is when this same translation block
is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
could be addressed by invalidating the previous translation block
and creating a new one which always ends with the i/o.

And where to do that? cpu_io_recompile just exits the TB but what
generates the new TB? I need some more clues to understands how to do
this.

 cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;

sets the cflags for the next cb, which typically will fail to find and
then regenerate. Normally cflags_next_tb is empty.

Shouldn't this only regenerate the next TB on the first loop iteration and not afterwards?

Regards,
BALATON Zoltan

Reply via email to