On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.pr...@memverge.com> wrote:
>
> On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote:
> > On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xi...@bytedance.com> wrote:
> > >
> > > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.pr...@memverge.com> 
> > > wrote:
> > > >
> > > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price 
> > > > > <gregory.pr...@memverge.com> wrote:
> > > > > >
> > > > > > For a variety of performance reasons, this will not work the way you
> > > > > > want it to.  You are essentially telling QEMU to map the vmem0 into 
> > > > > > a
> > > > > > virtual cxl device, and now any memory accesses to that memory 
> > > > > > region
> > > > > > will end up going through the cxl-type3 device logic - which is an 
> > > > > > IO
> > > > > > path from the perspective of QEMU.
> > > > >
> > > > > I didn't understand exactly how the virtual cxl-type3 device works. I
> > > > > thought it would go with the same "guest virtual address ->  guest
> > > > > physical address -> host physical address" translation totally done by
> > > > > CPU. But if it is going through an emulation path handled by virtual
> > > > > cxl-type3, I agree the performance would be bad. Do you know why
> > > > > accessing memory on a virtual cxl-type3 device can't go with the
> > > > > nested page table translation?
> > > > >
> > > >
> > > > Because a byte-access on CXL memory can have checks on it that must be
> > > > emulated by the virtual device, and because there are caching
> > > > implications that have to be emulated as well.
> > >
> > > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
> > > CXL memory data path goes through them, the performance would be
> > > pretty problematic. We have actually run Intel's Memory Latency
> > > Checker benchmark from inside a guest VM with both system-DRAM and
> > > virtual CXL-type3 configured. The idle latency on the virtual CXL
> > > memory is 2X of system DRAM, which is on-par with the benchmark
> > > running from a physical host. I need to debug this more to understand
> > > why the latency is actually much better than I would expect now.
> >
> > So we double checked on benchmark testing. What we see is that running
> > Intel Memory Latency Checker from a guest VM with virtual CXL memory
> > VS from a physical host with CXL1.1 memory expander has the same
> > latency.
> >
> > From guest VM: local socket system-DRAM latency is 117.0ns, local
> > socket CXL-DRAM latency is 269.4ns
> > From physical host: local socket system-DRAM latency is 113.6ns ,
> > local socket CXL-DRAM latency is 267.5ns
> >
> > I also set debugger breakpoints on cxl_type3_read/cxl_type3_write
> > while running the benchmark testing but those two functions are not
> > ever hit. We used the virtual CXL configuration while launching QEMU
> > but the CXL memory is present as a separate NUMA node and we are not
> > creating devdax devices. Does that make any difference?
> >
>
> Could you possibly share your full QEMU configuration and what OS/kernel
> you are running inside the guest?

Sounds like the technical details are explained on the other thread.
>From what I understand now, if we don't go through a complex CXL
setup, it wouldn't go through the emulation path.

Here is our exact setup. Guest runs Linux kernel 6.6rc2

taskset --cpu-list 0-47,96-143 \
numactl -N 0 -m 0 ${QEMU} \
-M q35,cxl=on,hmat=on \
-m 64G \
-smp 8,sockets=1,cores=8,threads=1 \
-object memory-backend-ram,id=ram0,size=45G \
-numa node,memdev=ram0,cpus=0-7,nodeid=0 \
-msg timestamp=on -L /usr/share/seabios \
-enable-kvm \
-object 
memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind
\
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
-numa node,memdev=vmem0,nodeid=1 \
-M 
cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k
\
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=14 \
-numa dist,src=1,dst=0,val=14 \
-numa dist,src=1,dst=1,val=10 \
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-latency,latency=91
\
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-latency,latency=100
\
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-latency,latency=91
\
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-latency,latency=100
\
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-bandwidth,bandwidth=262100M
\
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-bandwidth,bandwidth=30000M
\
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-bandwidth,bandwidth=176100M
\
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-bandwidth,bandwidth=30000M
\
-drive file="${DISK_IMG}",format=qcow2 \
-device pci-bridge,chassis_nr=3,id=pci.3,bus=pcie.0,addr=0xd \
-netdev 
tap,id=vm-sk-tap22,ifname=tap22,script=/usr/local/etc/qemu-ifup,downscript=no
\
-device 
virtio-net-pci,netdev=vm-sk-tap22,id=net0,mac=02:11:17:01:7e:33,bus=pci.3,addr=0x1,bootindex=3
\
-serial mon:stdio

>
> The only thing I'm surprised by is that the numa node appears without
> requiring the driver to generate the NUMA node.  It's possible I missed
> a QEMU update that allows this.
>
> ~Gregory

Reply via email to