On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.pr...@memverge.com> wrote: > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote: > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.pr...@memverge.com> > > wrote: > > > > > > For a variety of performance reasons, this will not work the way you > > > want it to. You are essentially telling QEMU to map the vmem0 into a > > > virtual cxl device, and now any memory accesses to that memory region > > > will end up going through the cxl-type3 device logic - which is an IO > > > path from the perspective of QEMU. > > > > I didn't understand exactly how the virtual cxl-type3 device works. I > > thought it would go with the same "guest virtual address -> guest > > physical address -> host physical address" translation totally done by > > CPU. But if it is going through an emulation path handled by virtual > > cxl-type3, I agree the performance would be bad. Do you know why > > accessing memory on a virtual cxl-type3 device can't go with the > > nested page table translation? > > > > Because a byte-access on CXL memory can have checks on it that must be > emulated by the virtual device, and because there are caching > implications that have to be emulated as well.
Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the CXL memory data path goes through them, the performance would be pretty problematic. We have actually run Intel's Memory Latency Checker benchmark from inside a guest VM with both system-DRAM and virtual CXL-type3 configured. The idle latency on the virtual CXL memory is 2X of system DRAM, which is on-par with the benchmark running from a physical host. I need to debug this more to understand why the latency is actually much better than I would expect now. > > The cxl device you are using is an emulated CXL device - not a > virtualization interface. Nuanced difference: the emulated device has > to emulate *everything* that CXL device does. > > What you want is passthrough / managed access to a real device - > virtualization. This is not the way to accomplish that. A better way > to accomplish that is to simply pass the memory through as a static numa > node as I described. That would work, too. But I think a kernel change is required to establish the correct memory tiering if we go this routine. > > > > > When we had a discussion with Intel, they told us to not use the KVM > > option in QEMU while using virtual cxl type3 device. That's probably > > related to the issue you described here? We enabled KVM though but > > haven't seen the crash yet. > > > > The crash really only happens, IIRC, if code ends up hosted in that > memory. I forget the exact scenario, but the working theory is it has > to do with the way instruction caches are managed with KVM and this > device. > > > > > > > You're better off just using the `host-nodes` field of host-memory > > > and passing bandwidth/latency attributes though via `-numa hmat-lb` > > > > We tried this but it doesn't work from end to end right now. I > > described the issue in another fork of this thread. > > > > > > > > In that scenario, the guest software doesn't even need to know CXL > > > exists at all, it can just read the attributes of the numa node > > > that QEMU created for it. > > > > We thought about this before. But the current kernel implementation > > requires a devdax device to be probed and recognized as a slow tier > > (by reading the memory attributes). I don't think this can be done via > > the path you described. Have you tried this before? > > > > Right, because the memory tiering component lumps the nodes together. > > Better idea: Fix the memory tiering component > > I cc'd you on another patch line that is discussing something relevant > to this. > > https://lore.kernel.org/linux-mm/87fs00njft....@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8 > > The point is: There's no need for this to be a dax device at all, there > is no need for the guest to even know what is providing the memory, or > for the guest to have any management access to the memory. It just > wants the memory and the ability to tier it. > > So we should fix the memory tiering component to work with this > workflow. Agreed. We really don't need the devdax device at all. I thought that choice was made due to the memory tiering concept being started with pmem ... Let's continue this part of the discussion on the above thread. > > ~Gregory