Re: [RFC PATCH] target/ppc: Inline most of dcbz helper

Richard Henderson Mon, 28 Apr 2025 06:48:12 -0700

On 4/28/25 06:26, BALATON Zoltan wrote:

I have tried profiling the dst in real card vfio vram with dcbz case (with 100 iterationsinstead of 10000 in above tests) but I'm not sure I understand the results. vperm and dcbzshow up but not too high. Can somebody explain what is happening here and where theoverhead likely comes from? Here is the profile result I got:
Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
   Children      Self  Command          Shared Object            Symbol
-   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.] 
cpu_exec_loop
    - 98.49% cpu_exec_loop
       - 98.48% cpu_tb_exec
          - 90.95% 0x7f4e705d8f15
               helper_ldub_mmu
               do_ld_mmio_beN
             - cpu_io_recompile
                - 45.79% cpu_loop_exit_noexc

I think the real problem is the number of loop exits due to i/o. If I'm reading thisrightly, 45% of execution is in cpu_io_recompile.

I/O can only happen as the last insn of a translation block. When we detect that it hashappened in the middle of a translation block, we abort the block, compile a new one, andrestart execution.

Where this becomes a bottleneck is when this same translation block is in a loop. Exactlythis case of memset/memcpy of VRAM. This could be addressed by invalidating the previoustranslation block and creating a new one which always ends with the i/o.

r~

Re: [RFC PATCH] target/ppc: Inline most of dcbz helper

Reply via email to