On Fri, 16 May 2025 11:30:49 +0900 Itaru Kitayama <itaru.kitay...@linux.dev> wrote:
> Hi Jonathan, > > > On May 13, 2025, at 20:14, Jonathan Cameron <jonathan.came...@huawei.com> > > wrote: > > > > V13: > > - Make CXL fixed memory windows sysbus devices. > > IIRC this was requested by Peter in one of the reviews a long time back > > but at the time the motivation was less strong than it becomes with some > > WiP patches for hotness monitoring and high performance direct connect > > where we need a machine type independent way to iterate all the CXL > > fixed memory windows. This is a convenient place to do it so drag that > > work forward into this series. > > > > This allows us to drop separate list and necessary machine specific > > access code in favour of > > object_child_foreach_recursive(object_get_root(),...) > > One snag is that the ordering of multiple fixed memory windows in that > > walk depends on the underlying g_hash_table iterations rather than the > > order of creation. In the memory map layout and ACPI table creation we > > need both stable and predictable ordering. Resolve this in a similar > > fashion to object_class_get_list_sorted() be throwing them in a GSList > > and sorting that. Only use this when a sorted list is needed. > > > > Dropped RFC as now I'm happy with this code and would like to get it > > upstream! Particularly as it broken even today due to enscripten > > related changes that stop us using g_slist_sort(). Easy fix though. > > > > Note that we have an issue for CXL emulation in general and TCG which > > is being discussed in: > > https://lore.kernel.org/all/20250425183524.00000...@huawei.com/ > > (also affects some other platforms) > > > > Until that is resolved, either rebase this back on 10.0 or just > > don't let code run out of it (don't use KMEM to expose it as normal > > memory, use DAX instead). > > > > Previous cover letter. > > > > Back in 2022, this series stalled on the absence of a solution to device > > tree support for PCI Expander Bridges (PXB) and we ended up only having > > x86 support upstream. I've been carrying the arm64 support out of tree > > since then, with occasional nasty surprises (e.g. UNIMP + DT issue seen > > a few weeks ago) and a fair number of fiddly rebases. > > gitlab.com/jic23/qemu cxl-<latest date> > > > > A recent discussion with Peter Maydell indicated that there are various > > other ACPI only features now, so in general he might be more relaxed > > about DT support being necessary. The upcoming vSMMUv3 support would > > run into this problem as well. > > > > I presented the background to the PXB issue at Linaro connect 2022. In > > short the issue is that PXBs steal MMIO space from the main PCI root > > bridge. The challenge is knowing how much to steal. > > > > On ACPI platforms, we can rely on EDK2 to perform an enumeration and > > configuration of the PCI topology and QEMU can update the ACPI tables > > after EDK2 has done this when it can simply read the space used by the > > root ports. On device tree, there is no entity to figure out that > > enumeration so we don't know how to size the stolen region. > > > > Three approaches were discussed: > > 1) Enumerating in QEMU. Horribly complex and the last thing we want is a > > 3rd enumeration implementation that ends up out of sync with EDK2 and > > the kernel (there are frequent issues because of how those existing > > implementations differ. > > 2) Figure out how to enumerate in kernel. I never put a huge amount of work > > into this, but it seemed likely to involve a nasty dance with similar > > very specific code to that EDK2 is carrying and would very challenging > > to upstream (given the lack of clarity on real use cases for PXBs and > > DT). > > 3) Hack it based on the control we have which is bus numbers. > > No one liked this but it worked :) > > > > The other little wrinkle would be the need to define full bindings for CXL > > on DT + implement a fairly complex kernel stack as equivalent in ACPI > > involves a static table, CEDT, new runtime queries via _DSM and a > > description > > of various components. Doable, but so far there is no interest on physical > > platforms. Worth noting that for now, the QEMU CXL emulation is all about > > testing and developing the OS stack, not about virtualization (performance > > is terrible except in some very contrived situations!) > > > > Back to posting as an RFC because there was some discussion of approach to > > modelling the devices that may need a bit of redesign. > > The discussion kind of died out on the back of DT issue and I doubt anyone > > can remember the details. > > > > https://lore.kernel.org/qemu-devel/20220616141950.23374-1-jonathan.came...@huawei.com/ > > > > There is only a very simple test in here, because my intent is not to > > duplicate what we have on x86, but just to do a smoke test that everything > > is hooked up. In general we need much more comprehensive end to end CXL > > tests but that requires a reaonsably stable guest software stack. A few > > people have expressed interest in working on that, but we aren't there yet. > > > > Note that this series has a very different use case to that in the proposed > > SBSA-ref support: > > https://lore.kernel.org/qemu-devel/20250117034343.26356-1-wangyuquan1...@phytium.com.cn/ > > > > SBSA-ref is a good choice if you want a relatively simple mostly fixed > > configuration. That works well with the limited host system > > discoverability etc as EDK2 can be build against a known configuration. > > > > My interest with this support in arm/virt is support host software stack > > development (we have a wide range of contributors, most of whom are working > > on emulation + the kernel support). I care about the weird corners. As such > > I need to be able to bring up variable numbers of host bridges, multiple CXL > > Fixed Memory Windows with varying characteristics (interleave etc), complex > > NUMA topologies with wierd performance characteristics etc. We can do that > > on x86 upstream today, or my gitlab tree. Note that we need arm support > > for some arch specific features in the near future (cache flushing). > > Doing kernel development with this need for flexibility on SBSA-ref is not > > currently practical. SBSA-ref CXL support is an excellent thing, just > > not much use to me for this work. > > > > Jonathan Cameron (5): > > hw/cxl-host: Add an index field to CXLFixedMemoryWindow > > hw/cxl: Make the CXL fixed memory windows devices. > > hw/cxl-host: Allow split of establishing memory address and mmio > > setup. > > hw/arm/virt: Basic CXL enablement on pci_expander_bridge instances > > pxb-cxl > > qtest/cxl: Add aarch64 virt test for CXL > > > > include/hw/arm/virt.h | 4 + > > include/hw/cxl/cxl.h | 4 + > > include/hw/cxl/cxl_host.h | 6 +- > > hw/acpi/cxl.c | 83 +++++++++------ > > hw/arm/virt-acpi-build.c | 34 ++++++ > > hw/arm/virt.c | 29 +++++ > > hw/cxl/cxl-host-stubs.c | 8 +- > > hw/cxl/cxl-host.c | 218 ++++++++++++++++++++++++++++++++------ > > hw/i386/pc.c | 51 ++++----- > > tests/qtest/cxl-test.c | 59 ++++++++--- > > tests/qtest/meson.build | 1 + > > 11 files changed, 389 insertions(+), 108 deletions(-) > > > > -- > > 2.43.0 > > > > With your series applied on top of upstream QEMU, the -drive option does not > work well with the sane CXL > setup (I use run_qemu.sh maintained by Marc et al. at Intel) see below: > > /home/realm/projects/qemu/build/qemu-system-aarch64 -machine > virt,accel=tcg,cxl=on,highmem=on,compact-highmem=on,highmem-ecam=on,highmem-mmio=on > -m 2048M,slots=0,maxmem=6144M -smp 2,sockets=1,cores=2,threads=1 -display > none -nographic -drive > if=pflash,format=raw,unit=0,file=AAVMF_CODE.fd,readonly=on -drive > if=pflash,format=raw,unit=1,file=AAVMF_VARS.fd -drive > file=root.img,format=raw,media=disk -kernel > mkosi.extra/boot/vmlinuz-6.15.0-rc4-00040-g128ad8fa385b -initrd > mkosi.extra/boot/initramfs-6.15.0-rc4-00040-g128ad8fa385b.img -append > selinux=0 audit=0 console=tty0 console=ttyS0 > root=PARTUUID=14d6bae9-c917-435d-89ea-99af1fa4439a ignore_loglevel rw > initcall_debug log_buf_len=20M memory_hotplug.memmap_on_memory=force > cxl_acpi.dyndbg=+fplm cxl_pci.dyndbg=+fplm cxl_core.dyndbg=+fplm > cxl_mem.dyndbg=+fplm cxl_pmem.dyndbg=+fplm cxl_port.dyndbg=+fplm > cxl_region.dyndbg=+fplm cxl_test.dyndbg=+fplm cxl_mock.dyndbg=+fplm > cxl_mock_mem.dyndbg=+fplm systemd.set_credential=agetty.autologin:root > systemd.set_credential=login.noauth:yes -device > e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev > user,id=net0,hostfwd=tcp::10022-:22 -cpu max -object > memory-backend-file,id=cxl-mem0,share=on,mem-path=cxltest0.raw,size=256M > -object > memory-backend-file,id=cxl-mem1,share=on,mem-path=cxltest1.raw,size=256M > -object > memory-backend-file,id=cxl-mem2,share=on,mem-path=cxltest2.raw,size=256M > -object > memory-backend-file,id=cxl-mem3,share=on,mem-path=cxltest3.raw,size=256M > -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=lsa0.raw,size=128K > -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=lsa1.raw,size=128K > -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=lsa2.raw,size=128K > -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=lsa3.raw,size=128K > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=53 -device > pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191 -device > cxl-rp,id=hb0rp0,bus=cxl.0,chassis=0,slot=0,port=0 -device > cxl-rp,id=hb0rp1,bus=cxl.0,chassis=0,slot=1,port=1 -device > cxl-rp,id=hb1rp0,bus=cxl.1,chassis=0,slot=2,port=0 -device > cxl-rp,id=hb1rp1,bus=cxl.1,chassis=0,slot=3,port=1 -device > cxl-upstream,port=4,bus=hb0rp0,id=cxl-up0,multifunction=on,addr=0.0,sn=12345678 > -device cxl-switch-mailbox-cci,bus=hb0rp0,addr=0.1,target=cxl-up0 -device > cxl-upstream,port=4,bus=hb1rp0,id=cxl-up1,multifunction=on,addr=0.0,sn=12341234 > -device cxl-switch-mailbox-cci,bus=hb1rp0,addr=0.1,target=cxl-up1 -device > cxl-downstream,port=0,bus=cxl-up0,id=swport0,chassis=0,slot=4 -device > cxl-downstream,port=1,bus=cxl-up0,id=swport1,chassis=0,slot=5 -device > cxl-downstream,port=2,bus=cxl-up0,id=swport2,chassis=0,slot=6 -device > cxl-downstream,port=3,bus=cxl-up0,id=swport3,chassis=0,slot=7 -device > cxl-downstream,port=0,bus=cxl-up1,id=swport4,chassis=0,slot=8 -device > cxl-downstream,port=1,bus=cxl-up1,id=swport5,chassis=0,slot=9 -device > cxl-downstream,port=2,bus=cxl-up1,id=swport6,chassis=0,slot=10 -device > cxl-downstream,port=3,bus=cxl-up1,id=swport7,chassis=0,slot=11 -device > cxl-type3,bus=swport0,persistent-memdev=cxl-mem0,id=cxl-pmem0,lsa=cxl-lsa0 > -device > cxl-type3,bus=swport2,persistent-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1 > -device > cxl-type3,bus=swport4,volatile-memdev=cxl-mem2,id=cxl-vmem2,lsa=cxl-lsa2 > -device > cxl-type3,bus=swport6,volatile-memdev=cxl-mem3,id=cxl-vmem3,lsa=cxl-lsa3 -M > cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k,cxl-fmw.1.targets.0=cxl.0,cxl-fmw.1.targets.1=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.1.interleave-granularity=8k > -snapshot -object memory-backend-ram,id=mem0,size=2048M -numa > node,nodeid=0,memdev=mem0, -numa cpu,node-id=0,socket-id=0 -numa > dist,src=0,dst=0,val=10 > qemu-system-aarch64: -drive file=root.img,format=raw,media=disk: PCI: Only > PCI/PCIe bridges can be plugged into pxb-cxl > > Plain upstream QEMU aarch64 target vert machine can handle the -drive option > without an issue _without_ those cxl setup options added. I think the error > was seen with your previous cxl-2025-03-20 branch. > I think I understand what is going on here. This is a challenge to fix at qemu side because it is relying on a bunch of old behavior. For a while now legacy -drive parameters have been split into parts that go in -drive and a separate -device entry that covers the hardware - even with that though we'd need to specify a bus as on arm64 the default for -drive is virtio-blk which ends up I connected on the pcie bus. Note this is just as broken if you use any pxb-pcie instances as those have similar fixed assumption that you can only hang root ports off them. I think best solution is probably to stop use the old style -drive file=root.img,format=raw,media=disk and switch to fully expanded virtio solution along the lines of -drive if=none,file=root.img,format=raw,media=disk,id=hd \ -device virtio-blk,bus=pcie.0,drive=hd which should be fine on x86 and arm64 + any other architectures we support in the future and will always bring the disk up as an RCiEP on the main pcie root complex. Jonathan > Thanks, > Itaru.