Re: [RFC PATCH] target/ppc: Inline most of dcbz helper

BALATON Zoltan Wed, 30 Apr 2025 08:16:30 -0700

On Wed, 30 Apr 2025, Alex Bennée wrote:

BALATON Zoltan <bala...@eik.bme.hu> writes:

On Wed, 30 Apr 2025, Nicholas Piggin wrote:

On Wed Apr 30, 2025 at 7:09 AM AEST, BALATON Zoltan wrote:

On Tue, 29 Apr 2025, Alex Bennée wrote:

BALATON Zoltan <bala...@eik.bme.hu> writes:

On Tue, 29 Apr 2025, Alex Bennée wrote:

BALATON Zoltan <bala...@eik.bme.hu> writes:

On Mon, 28 Apr 2025, Richard Henderson wrote:

On 4/28/25 06:26, BALATON Zoltan wrote:

<snip>


if we've been here before (needing n insn from the base addr) we will
have a cached translation we can re-use. It doesn't stop the longer TB
being called again as we re-enter a loop.


So then maybe it should at least check if there's already a cached TB
where it can continue before calling cpu_io_recompile in io_prepare and
only recompile if needed?


It basically does do that AFAIKS. cpu_io_recompile() name is misleading
it does not cause a recompile, it just updates cflags and exits. Next
entry will look up TB that has just 1 insn and enter that.


After reading it I came to the same conclusion but then I don't
understand what causes the problem. Is it just that it will exit the
loop for every IO to look up the recompiled TB? It looks like it tries
to chain TBs, why does that not work here?


Any MMIO access has to come via the slow path. Any MMIO also currently
has to be the last instruction in a block in case the operation triggers
a change in the translation regime that needs to be picked up by the
next instruction you execute.

This is a pathological case when modelling VRAM on a device because its
going to be slow either way. At least if you model the multiple byte
access with a helper you can amortise some of the cost of the MMU lookup
with a single probe_() call.

I think there is some mix up here because of all the different scenarios Ibenchmarked so let me try to clear that up. The goal is to find out whyaccess to vfio-pci passed through graphics card VRAM is slower thanexpected when the host should be faster than those mostly embedded or oldPPCs used on real machines with only 4x PCIe or PCIe to PCI bridges. Inthis case we are not emulating VRAM but mapping the framebuffer from thereal card and access that. To find where the slow down comes from I'vebenchmarked all the cases upthread but here are the relevant parts againfor easier comparison:

First both src and dst are in RAM (just malloced buffers so this is thebase line):


src 0xb79c8008 dst 0xb78c7008
byte loop: 21.16 sec
memset: 3.85 sec
memcpy: 5.07 sec
copyToVRAMNoAltivec: 2.52 sec
copyToVRAMAltivec: 2.42 sec
copyFromVRAMNoAltivec: 6.39 sec
copyFromVRAMAltivec: 7.02 sec

The FromVRAM cases use dcbz to avoid loading RAM contents to cache on realmachine that is about to be overwritten so dcbz is never applied to MMIO.(Arguably it should use dcba but for some reason nobody remembers why ituses dcbz instead.) The ToVRAM cases have dcbt which is noop on QEMU. Iguess the difference we see here is because of probe_access in dcbz as wasshown by previous profiling. Replacing that with dcba (which is noop inQEMU) makes ToVRAM and FromVRAM run the about the same (you can find thatcase in original message). FromVRAM still a bit slower for some reason butmost of this overhead can be accounted to dcbz.

In second test dst is mmapped from emulated ati-vga framebuffer BAR. Wecan say we emulate vram here but that's just a ram memory region createdin vga.c as:


memory_region_init_ram_nomigrate(&s->vram, obj, "vga.vram", s->vram_size, 
&local_err);

it also has dirty tracking enabled, I don't know if that has any effect.This is shown in left column here:


dst in emulated ati-vga               | dst in real card vfio vram
mapping 0x80800000                      mapping 0x80800000
src 0xb78e0008 dst 0xb77de000         | src 0xb7ec5008 dst 0xb7dc3000
byte loop: 21.2 sec                   | byte loop: 563.98 sec
memset: 3.89 sec                      | memset: 39.25 sec
memcpy: 5.07 sec                      | memcpy: 140.49 sec
copyToVRAMNoAltivec: 2.53 sec         | copyToVRAMNoAltivec: 72.03 sec
copyToVRAMAltivec: 12.22 sec          | copyToVRAMAltivec: 78.12 sec
copyFromVRAMNoAltivec: 6.43 sec       | copyFromVRAMNoAltivec: 728.52 sec
copyFromVRAMAltivec: 35.33 sec        | copyFromVRAMAltivec: 754.95 sec

Here we see that AltiVec cases have additional overhead which I think isrelated to vperm as that's the only op that does not seem to be compiledto something sensible but calls an unoptimised helper (although that'salso there for RAM so not sure why this is slower). But this shows noother overhead due to MMIO being involved as the NoAltivec cases are thesame as with RAM.

Last case, shown in right column above, is when instead of ati-vga I havea real ATI card passed through with vfio-pci which is much slower thanwhat is explained only by PCI overhead and I'm trying to find out thesource of that slow down.

I've now also run 1000 iterations (vs. 10000 above so numbers are 10 timesless here than above in right column) of the last case again (using realcard with vfio-pci) with qemu-system-ppc vs. qemu-system-ppc64 to see ifmttcg has any effect:


1000 iterations qemu-system-ppc       | qemu-system-ppc64
mapping 0x80800000                      mapping 0x80800000
src 0xb7dc6008 dst 0xb7cc4000         | src 0xb78b8008 dst 0xb77b6000
byte loop: 58.44 sec                  | byte loop: 57.72 sec
memset: 3.99 sec                      | memset: 3.93 sec
memcpy: 14.43 sec                     | memcpy: 14.24 sec
copyToVRAMNoAltivec: 7.27 sec         | copyToVRAMNoAltivec: 7.15 sec
copyToVRAMAltivec: 7.9 sec            | copyToVRAMAltivec: 7.78 sec
copyFromVRAMNoAltivec: 72.68 sec      | copyFromVRAMNoAltivec: 72.69 sec
copyFromVRAMAltivec: 75.15 sec        | copyFromVRAMAltivec: 75.05 sec

This does not seem to have much effect so maybe not having mttcg does notenable icount just uses the same function which were confusing in theprofile.


Finally I dug up some comparable results from real machine vs QEMU.

These are with QEMU with the default -cpu 7454 and -cpu g3 (to checkAltiVec overhead but there seems to be only about 1%):


https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2939
https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2941

and same card on real machine:

https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2414

It seems for larger rectangles we approach the same limits but smallertransfers (what I think VRAM copy also uses) have some big overheadcompared to what PCIe communication alone explains.


Another card on QEMU:

https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2931

and on real machine:

https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2372

or a similar card (I did not find exactly the same) with slower CPU realmachine:


https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/1672

Also on real machine using optimised routines does help so using widertransfers is better than default unoptimised case.

I was thinking maybe we need a flag or counter
to see if cpu_io_recompile is called more than once and after a limit
invalidate the TB and create two new ones the first ending at the I/O and
then what cpu_io_recompile does now which as I understood was what Richard
suggested but I don't know how to do that.


memset/cpy routines had kind of the same problem with real hardware.
They wanted to use vector instructions for best performance, but when
those are used on MMIO they would trap and be very slow.


Why do those trap on MMIO on real machine? These routines were tested
on real machines and the reasoning to use the widest possible access
was that PCI transfer has overhead and that is minimised by
transferring more bits in one op. I think they also verifed that it
works at least for the 32 bit CPUs up to G4 that were used on real
AmigaNG machines. There are some benchmark results here:
https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS?start=60 which
is also where the benchmark I used comes from so this should be
similar. I think the MemCopy on that page has plain unoptimised copy
as Copy to/from VRAM and optimised routines similar to this benchmark
as Read/Write Pixel Array, but it's not easy to search. Some of the
machines like Pegasos II and AmigaOne XE were made with both G3 or G4
CPUs so if I find a result from those with same graphics card that
could show if AltiVec is faster (although the G4s were also higher
clock so not directly comparable). Some results there are also from
QEMU, mostly those that are with SiliconMotion 502 but that does not
have this problem only vfio-pci pass through.


They don't - what we need is to have a RAM-like-device model for QEMU
where we can relax the translation rules because we know we are writing
to RAM like things that don't have registers or other state changing
behaviour.

The poor behaviour is because QEMU currently treats all MMIO as
potentially system state altering where as for VRAM it doesn't need to.

This does not seem to be the case with emulated ati-vga, and with vfio-pciit should also be mapped memory from the graphics card which technicallyis MMIO but how does QEMU decides that when it does not seem to considerati-vga as IO? Typically in QEMU MMIO is an io memory region that goesthrough memops and that's understandably slow but here we shouldread/write mapped memory space. Maybe I should try to find out whatvfio-pci actually does here but it is used for gaming with KVM and therepeople get near native performance so I don't think there is an overheadin vfio-pci.

So I could explain some small overheads with dcbz and maybe vperm but thebiggest one seems to only happen when accessing real card VRAM withvfio-pci that does not seem to happen on real machine and I could notreproduce with emulated ati-vga either but that's all I could find out sofar and still don't get where the biggest overhead comes from.


Regards,
BALATON Zoltan

So maybe it's something
with how vfio-pci maps PCI memory BARs?


I don't know about vfio-pci but blob resources mapped via virtio-gpu
just appear as chunks of RAM to the guest - hence no trapping.

Problem is we don't know ahead of time if some routine will access
MMIO or not. You could recompile it with fewer instructions but then
it will be slow when used for regular memory.

Heuristics are tough because you could have e.g., one initial big
memset to clear a MMIO region that iterates many times over inner
loop of dcbz instructions, but then is never used again for MMIO but
important for regular page clearing. Making something that dynamically
decays or periodically would recompile to non-IO case perhaps, but
then complexity goes up.


We can't have heuristics when we must prioritise correctness. However we
could expand the device model to make the exact behaviour of different
devices clear and optimise when we know it is safe.

I would prefer not like to do that just for a microbenchmark, but if
you think it is reasonable overall win for average workloads of your
users then perhaps.


I'm still trying to understand what to optimise. So far it looks like
that dcbz has the least impact, then vperm a bit bigger but still only
about a few percent and the biggest impact is still not known for sure
but we see faster access on real machines that run on slower PCIe
(only 4x at best) while CPU benchmarks don't show slower performance
on QEMU only accessing passed through card's VRAM is slower than
expected. But if there's a trap involved I've found before that
exceptions are slower with QEMU but I did not see evidence of that in
the profile.

Regards,
BALATON Zoltan

Re: [RFC PATCH] target/ppc: Inline most of dcbz helper

Reply via email to