On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote: > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price > <gregory.pr...@memverge.com> wrote: > > > > If you drop this line: > > > > -numa node,memdev=vmem0,nodeid=1 > > We tried this as well and it works after going through the cxlcli > process and created the devdax device. The problem is that without the > "nodeid=1" configuration, we cannot connect with the explicit per numa > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the > lb information to VM's hmat. >
Yeah, this is what Jonathan was saying - right now there isn't a good way (in QEMU) to pass the hmat/cdat stuff down through the device. Needs to be plumbed out. In the meantime: You should just straight up drop the cxl device from your QEMU config. It doesn't actually get you anything. > From what I understand so far, the guest kernel will dynamically > create a numa node after a cxl devdax device is created. That means we > don't know the numa node until after VM boot. 2. QEMU can only > statically parse the lb information to the VM at boot time. How do we > connect these two things? during boot, the kernel discovers all the memory regions exposed to bios. In this qemu configuration you have defined: region 0: CPU + DRAM node region 1: DRAM only node region 2: CXL Fixed Memory Window (the last line of the cxl stuff) The kernel reads this information on boot and reserves 1 numa node for each of these regions. The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1 respectively. Node2 sits dormant until you go through the cxl-cli startup sequence. What you're asking for is for the QEMU team to plumb hmat/cdat information down through the type3 device. I *think* that is presently possible with a custom CDAT file - but Jonathan probably has more details on that. You'll have to go digging for answers on that one. Now - even if you did that - the current state of the cxl-type3 device is *not what you want* because your memory accesses will be routed through the read/write functions in the emulated device. What Jonathan and I discussed on the other thread is how you might go about slimming this down to allow pass-through of the memory without the need for all the fluff. This is a non-trivial refactor of the existing device, so i would not expect that any time soon. At the end of the day, quickest way to get-there-from-here is to just drop the cxl related lines from your QEMU config, and keep everything else. > > Assuming that the same issue applies to a physical server with CXL. > Were you able to see a host kernel getting the correct lb information > for a CXL devdax device? > Yes, if you bring up a CXL device via cxl-cli on real hardware, the subsequent numa node ends up in the "lower tier" of the memory-tiering infrastructure. ~Gregory