Hi Paolo, Since this series has received Reviewed-by/Acked-by on all patches, besides some coding style comments from Alexey and the suggestion to document the bitmap consistency from David in patch #4, Any other comments? Or I would send the next version to resolve them.
Thanks Chenyi On 5/30/2025 4:32 PM, Chenyi Qiang wrote: > This is the v6 series of the shared device assignment support. > > Compared with the last version [1], this series retains the basic support > and removes the additional complex error handling, which can be added > back when necessary. Meanwhile, the patchset has been re-organized to > be clearer. > > Overview of this series: > > - Patch 1-3: Preparation patches. These include function exposure and > some function prototype changes. > - Patch 4: Introduce a new object to implement RamDiscardManager > interface and a helper to notify the shared/private state change. > - Patch 5: Enable coordinated discarding of RAM with guest_memfd through > the RamDiscardManager interface. > > More small changes or details can be found in the individual patches. > > --- > > Background > ========== > Confidential VMs have two classes of memory: shared and private memory. > Shared memory is accessible from the host/VMM while private memory is > not. Confidential VMs can decide which memory is shared/private and > convert memory between shared and private at runtime. > > "guest_memfd" is a new kind of fd whose primary goal is to serve guest > private memory. In current implementation, shared memory is allocated > with normal methods (e.g. mmap or fallocate) while private memory is > allocated from guest_memfd. When a VM performs memory conversions, QEMU > frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from > one side, and allocates new pages from the other side. This will cause a > stale IOMMU mapping issue mentioned in [2] when we try to enable shared > device assignment in confidential VMs. > > Solution > ======== > The key to enable shared device assignment is to update the IOMMU mappings > on page conversion. RamDiscardManager, an existing interface currently > utilized by virtio-mem, offers a means to modify IOMMU mappings in > accordance with VM page assignment. Page conversions is similar to > hot-removing a page in one mode and adding it back in the other. > > This series implements a RamDiscardmanager for confidential VMs and > utilizes its infrastructure to notify VFIO of page conversions. > > Limitation and future extension > =============================== > This series only supports the basic shared device assignment functionality. > There are still some limitations and areas that can be extended and > optimized in the future. > > Relationship with in-place conversion > ------------------------------------- > In-place page conversion is the ongoing work to allow mmap() of > guest_memfd to userspace so that both private and shared memory can use > the same physical memory as the backend. This new design eliminates the > need to discard pages during shared/private conversions. When it is > ready, shared device assignment needs be adjusted to achieve an > unmap-before-conversion-to-private and map-after-conversion-to-shared > sequence to be compatible with the change. > > Partial unmap limitation > ------------------------ > VFIO expects the DMA mapping for a specific IOVA to be mapped and > unmapped with the same granularity. The guest may perform partial > conversion, such as converting a small region within a larger one. To > prevent such invalid cases, current operations are performed with 4K > granularity. This could be optimized after DMA mapping cut operation > [3] is introduced in the future. We can always perform a > split-before-unmap if partial conversions happens. If the split > succeeds, the unmap will succeed and be atomic. If the split fails, the > unmap process fails. > > More attributes management > -------------------------- > Current RamDiscardManager can only manage a pair of opposite states like > populated/discared or shared/private. If more states need to be > considered, for example, support virtio-mem in confidential VMs, three > states would be possible (shared populated/private populated/discard). > Current framework cannot handle such scenario and we need to think of > some new framework at that time [4]. > > Memory overhead optimization > ---------------------------- > A comment from Baolu [5] suggests considering using Maple Tree or a generic > interval tree to manage private/shared state instead of a bitmap, which > can reduce memory consumption. This optmization can also be considered > in other bitmap use cases like dirty bitmaps for guest RAM. > > Testing > ======= > This patch series is tested based on mainline kernel since TDX base > support has been merged. The QEMU repo is available at > QEMU: > https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-30-v2 > > To facilitate shared device assignment with the NIC, employ the legacy > type1 VFIO with the QEMU command: > > qemu-system-x86_64 [...] > -device vfio-pci,host=XX:XX.X > > The parameter of dma_entry_limit needs to be adjusted. For example, a > 16GB guest needs to adjust the parameter like > vfio_iommu_type1.dma_entry_limit=4194304. > > If use the iommufd-backed VFIO with the qemu command: > > qemu-system-x86_64 [...] > -object iommufd,id=iommufd0 \ > -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 > > > Because the new features like cut_mapping operation will only be support in > iommufd. > It is more recommended to use the iommufd-backed VFIO. > > Related link > ============ > [1] > https://lore.kernel.org/qemu-devel/20250520102856.132417-1-chenyi.qi...@intel.com/ > [2] > https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonz...@redhat.com/ > [3] > https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_...@nvidia.com/ > [4] > https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090...@redhat.com/ > [5] > https://lore.kernel.org/qemu-devel/013b36a9-9310-4073-b54c-9c511f23d...@linux.intel.com/ > > Chenyi Qiang (5): > memory: Export a helper to get intersection of a MemoryRegionSection > with a given range > memory: Change memory_region_set_ram_discard_manager() to return the > result > memory: Unify the definiton of ReplayRamPopulate() and > ReplayRamDiscard() > ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock > with guest_memfd > physmem: Support coordinated discarding of RAM with guest_memfd > > MAINTAINERS | 1 + > accel/kvm/kvm-all.c | 9 + > hw/virtio/virtio-mem.c | 83 +++--- > include/system/memory.h | 100 +++++-- > include/system/ramblock.h | 22 ++ > migration/ram.c | 5 +- > system/memory.c | 22 +- > system/meson.build | 1 + > system/physmem.c | 18 +- > system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++ > system/trace-events | 3 + > 11 files changed, 660 insertions(+), 84 deletions(-) > create mode 100644 system/ram-block-attributes.c >
