On Wed Apr 30, 2025 at 7:09 AM AEST, BALATON Zoltan wrote:
> On Tue, 29 Apr 2025, Alex Bennée wrote:
>> BALATON Zoltan <bala...@eik.bme.hu> writes:
>>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>>> BALATON Zoltan <bala...@eik.bme.hu> writes:
>>>>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>>>>> On 4/28/25 06:26, BALATON Zoltan wrote:
>>>>>>> I have tried profiling the dst in real card vfio vram with dcbz
>>>>>>> case (with 100 iterations instead of 10000 in above tests) but I'm
>>>>>>> not sure I understand the results. vperm and dcbz show up but not
>>>>>>> too high. Can somebody explain what is happening here and where the
>>>>>>> overhead likely comes from? Here is the profile result I got:
>>>>>>> Samples: 104K of event 'cycles:Pu', Event count (approx.):
>>>>>>> 122371086557
>>>>>>>    Children      Self  Command          Shared Object            Symbol
>>>>>>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
>>>>>>> cpu_exec_loop
>>>>>>>     - 98.49% cpu_exec_loop
>>>>>>>        - 98.48% cpu_tb_exec
>>>>>>>           - 90.95% 0x7f4e705d8f15
>>>>>>>                helper_ldub_mmu
>>>>>>>                do_ld_mmio_beN
>>>>>>>              - cpu_io_recompile
>>>>>>>                 - 45.79% cpu_loop_exit_noexc
>>>>>>
>>>>>> I think the real problem is the number of loop exits due to i/o.  If
>>>>>> I'm reading this rightly, 45% of execution is in cpu_io_recompile.
>>>>>>
>>>>>> I/O can only happen as the last insn of a translation block.
>>>>>
>>>>> I'm not sure I understand this. A comment above cpu_io_recompile says
>>>>> "In deterministic execution mode, instructions doing device I/Os must
>>>>> be at the end of the TB." Is that wrong? Otherwise shouldn't this only
>>>>> apply if running with icount or something like that?
>>>>
>>>> That comment should be fixed. It used to only be the case for icount
>>>> mode but there was another race bug that meant we need to honour device
>>>> access as the last insn for both modes.
>>>>
>>>>>
>>>>>> When we detect that it has happened in the middle of a translation
>>>>>> block, we abort the block, compile a new one, and restart execution.
>>>>>
>>>>> Where does that happen? The calls of cpu_io_recompile in this case
>>>>> seem to come from io_prepare which is called from do_ld16_mmio_beN if
>>>>> (!cpu->neg.can_do_io) but I don't see how can_do_io is set.
>>>>
>>>> Inline by set_can_do_io()
>>>
>>> That one I've found but don't know where the cpu_loop_exit returns
>>> from the end of cpu_io_recompile.
>>
>> cpu_loop_exit longjmp's back to the top of the execution loop.
>>
>>>
>>>>>> Where this becomes a bottleneck is when this same translation block
>>>>>> is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
>>>>>> could be addressed by invalidating the previous translation block
>>>>>> and creating a new one which always ends with the i/o.
>>>>>
>>>>> And where to do that? cpu_io_recompile just exits the TB but what
>>>>> generates the new TB? I need some more clues to understands how to do
>>>>> this.
>>>>
>>>>  cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;
>>>>
>>>> sets the cflags for the next cb, which typically will fail to find and
>>>> then regenerate. Normally cflags_next_tb is empty.
>>>
>>> Shouldn't this only regenerate the next TB on the first loop iteration
>>> and not afterwards?
>>
>> if we've been here before (needing n insn from the base addr) we will
>> have a cached translation we can re-use. It doesn't stop the longer TB
>> being called again as we re-enter a loop.
>
> So then maybe it should at least check if there's already a cached TB 
> where it can continue before calling cpu_io_recompile in io_prepare and 
> only recompile if needed?

It basically does do that AFAIKS. cpu_io_recompile() name is misleading
it does not cause a recompile, it just updates cflags and exits. Next
entry will look up TB that has just 1 insn and enter that.

> I was thinking maybe we need a flag or counter 
> to see if cpu_io_recompile is called more than once and after a limit 
> invalidate the TB and create two new ones the first ending at the I/O and 
> then what cpu_io_recompile does now which as I understood was what Richard 
> suggested but I don't know how to do that.

memset/cpy routines had kind of the same problem with real hardware.
They wanted to use vector instructions for best performance, but when
those are used on MMIO they would trap and be very slow.

Problem is we don't know ahead of time if some routine will access
MMIO or not. You could recompile it with fewer instructions but then
it will be slow when used for regular memory.

Heuristics are tough because you could have e.g., one initial big
memset to clear a MMIO region that iterates many times over inner
loop of dcbz instructions, but then is never used again for MMIO but
important for regular page clearing. Making something that dynamically
decays or periodically would recompile to non-IO case perhaps, but
then complexity goes up.

I would prefer not like to do that just for a microbenchmark, but if
you think it is reasonable overall win for average workloads of your
users then perhaps.

Thanks,
Nick

Reply via email to