On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.pr...@memverge.com> wrote:
>
> On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.pr...@memverge.com> 
> > wrote:
> > >
> > > For a variety of performance reasons, this will not work the way you
> > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > virtual cxl device, and now any memory accesses to that memory region
> > > will end up going through the cxl-type3 device logic - which is an IO
> > > path from the perspective of QEMU.
> >
> > I didn't understand exactly how the virtual cxl-type3 device works. I
> > thought it would go with the same "guest virtual address ->  guest
> > physical address -> host physical address" translation totally done by
> > CPU. But if it is going through an emulation path handled by virtual
> > cxl-type3, I agree the performance would be bad. Do you know why
> > accessing memory on a virtual cxl-type3 device can't go with the
> > nested page table translation?
> >
>
> Because a byte-access on CXL memory can have checks on it that must be
> emulated by the virtual device, and because there are caching
> implications that have to be emulated as well.

Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
CXL memory data path goes through them, the performance would be
pretty problematic. We have actually run Intel's Memory Latency
Checker benchmark from inside a guest VM with both system-DRAM and
virtual CXL-type3 configured. The idle latency on the virtual CXL
memory is 2X of system DRAM, which is on-par with the benchmark
running from a physical host. I need to debug this more to understand
why the latency is actually much better than I would expect now.

>
> The cxl device you are using is an emulated CXL device - not a
> virtualization interface.  Nuanced difference:  the emulated device has
> to emulate *everything* that CXL device does.
>
> What you want is passthrough / managed access to a real device -
> virtualization.  This is not the way to accomplish that.  A better way
> to accomplish that is to simply pass the memory through as a static numa
> node as I described.

That would work, too. But I think a kernel change is required to
establish the correct memory tiering if we go this routine.

>
> >
> > When we had a discussion with Intel, they told us to not use the KVM
> > option in QEMU while using virtual cxl type3 device. That's probably
> > related to the issue you described here? We enabled KVM though but
> > haven't seen the crash yet.
> >
>
> The crash really only happens, IIRC, if code ends up hosted in that
> memory.  I forget the exact scenario, but the working theory is it has
> to do with the way instruction caches are managed with KVM and this
> device.
>
> > >
> > > You're better off just using the `host-nodes` field of host-memory
> > > and passing bandwidth/latency attributes though via `-numa hmat-lb`
> >
> > We tried this but it doesn't work from end to end right now. I
> > described the issue in another fork of this thread.
> >
> > >
> > > In that scenario, the guest software doesn't even need to know CXL
> > > exists at all, it can just read the attributes of the numa node
> > > that QEMU created for it.
> >
> > We thought about this before. But the current kernel implementation
> > requires a devdax device to be probed and recognized as a slow tier
> > (by reading the memory attributes). I don't think this can be done via
> > the path you described. Have you tried this before?
> >
>
> Right, because the memory tiering component lumps the nodes together.
>
> Better idea:  Fix the memory tiering component
>
> I cc'd you on another patch line that is discussing something relevant
> to this.
>
> https://lore.kernel.org/linux-mm/87fs00njft....@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8
>
> The point is: There's no need for this to be a dax device at all, there
> is no need for the guest to even know what is providing the memory, or
> for the guest to have any management access to the memory.  It just
> wants the memory and the ability to tier it.
>
> So we should fix the memory tiering component to work with this
> workflow.

Agreed. We really don't need the devdax device at all. I thought that
choice was made due to the memory tiering concept being started with
pmem ... Let's continue this part of the discussion on the above
thread.

>
> ~Gregory

Reply via email to