On 4/28/25 06:26, BALATON Zoltan wrote:
I have tried profiling the dst in real card vfio vram with dcbz case (with 100 iterations instead of 10000 in above tests) but I'm not sure I understand the results. vperm and dcbz show up but not too high. Can somebody explain what is happening here and where the overhead likely comes from? Here is the profile result I got:

Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
   Children      Self  Command          Shared Object            Symbol
-   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.] 
cpu_exec_loop
    - 98.49% cpu_exec_loop
       - 98.48% cpu_tb_exec
          - 90.95% 0x7f4e705d8f15
               helper_ldub_mmu
               do_ld_mmio_beN
             - cpu_io_recompile
                - 45.79% cpu_loop_exit_noexc

I think the real problem is the number of loop exits due to i/o. If I'm reading this rightly, 45% of execution is in cpu_io_recompile.

I/O can only happen as the last insn of a translation block. When we detect that it has happened in the middle of a translation block, we abort the block, compile a new one, and restart execution.

Where this becomes a bottleneck is when this same translation block is in a loop. Exactly this case of memset/memcpy of VRAM. This could be addressed by invalidating the previous translation block and creating a new one which always ends with the i/o.


r~

Reply via email to