Hi Paolo,

Since this series has received Reviewed-by/Acked-by on all patches,
besides some coding style comments from Alexey and the suggestion to
document the bitmap consistency from David in patch #4, Any other
comments? Or I would send the next version to resolve them.

Thanks
Chenyi

On 5/30/2025 4:32 PM, Chenyi Qiang wrote:
> This is the v6 series of the shared device assignment support.
> 
> Compared with the last version [1], this series retains the basic support
> and removes the additional complex error handling, which can be added
> back when necessary. Meanwhile, the patchset has been re-organized to
> be clearer.
> 
> Overview of this series:
> 
> - Patch 1-3: Preparation patches. These include function exposure and
>   some function prototype changes.
> - Patch 4: Introduce a new object to implement RamDiscardManager
>   interface and a helper to notify the shared/private state change.
> - Patch 5: Enable coordinated discarding of RAM with guest_memfd through
>   the RamDiscardManager interface.
> 
> More small changes or details can be found in the individual patches.
> 
> ---
> 
> Background
> ==========
> Confidential VMs have two classes of memory: shared and private memory.
> Shared memory is accessible from the host/VMM while private memory is
> not. Confidential VMs can decide which memory is shared/private and
> convert memory between shared and private at runtime.
> 
> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> private memory. In current implementation, shared memory is allocated
> with normal methods (e.g. mmap or fallocate) while private memory is
> allocated from guest_memfd. When a VM performs memory conversions, QEMU
> frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
> one side, and allocates new pages from the other side. This will cause a
> stale IOMMU mapping issue mentioned in [2] when we try to enable shared
> device assignment in confidential VMs.
> 
> Solution
> ========
> The key to enable shared device assignment is to update the IOMMU mappings
> on page conversion. RamDiscardManager, an existing interface currently
> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> accordance with VM page assignment. Page conversions is similar to
> hot-removing a page in one mode and adding it back in the other.
> 
> This series implements a RamDiscardmanager for confidential VMs and
> utilizes its infrastructure to notify VFIO of page conversions.
> 
> Limitation and future extension
> ===============================
> This series only supports the basic shared device assignment functionality.
> There are still some limitations and areas that can be extended and
> optimized in the future.
> 
> Relationship with in-place conversion
> -------------------------------------
> In-place page conversion is the ongoing work to allow mmap() of
> guest_memfd to userspace so that both private and shared memory can use
> the same physical memory as the backend. This new design eliminates the
> need to discard pages during shared/private conversions. When it is
> ready, shared device assignment needs be adjusted to achieve an
> unmap-before-conversion-to-private and map-after-conversion-to-shared
> sequence to be compatible with the change.
> 
> Partial unmap limitation
> ------------------------
> VFIO expects the DMA mapping for a specific IOVA to be mapped and
> unmapped with the same granularity. The guest may perform partial
> conversion, such as converting a small region within a larger one. To
> prevent such invalid cases, current operations are performed with 4K
> granularity. This could be optimized after DMA mapping cut operation
> [3] is introduced in the future. We can always perform a
> split-before-unmap if partial conversions happens. If the split
> succeeds, the unmap will succeed and be atomic. If the split fails, the
> unmap process fails.
> 
> More attributes management
> --------------------------
> Current RamDiscardManager can only manage a pair of opposite states like
> populated/discared or shared/private. If more states need to be
> considered, for example, support virtio-mem in confidential VMs, three
> states would be possible (shared populated/private populated/discard).
> Current framework cannot handle such scenario and we need to think of
> some new framework at that time [4].
> 
> Memory overhead optimization
> ----------------------------
> A comment from Baolu [5] suggests considering using Maple Tree or a generic
> interval tree to manage private/shared state instead of a bitmap, which
> can reduce memory consumption. This optmization can also be considered
> in other bitmap use cases like dirty bitmaps for guest RAM.
> 
> Testing
> =======
> This patch series is tested based on mainline kernel since TDX base
> support has been merged. The QEMU repo is available at
> QEMU: 
> https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-30-v2
> 
> To facilitate shared device assignment with the NIC, employ the legacy
> type1 VFIO with the QEMU command:
> 
> qemu-system-x86_64 [...]
>     -device vfio-pci,host=XX:XX.X
> 
> The parameter of dma_entry_limit needs to be adjusted. For example, a
> 16GB guest needs to adjust the parameter like
> vfio_iommu_type1.dma_entry_limit=4194304.
> 
> If use the iommufd-backed VFIO with the qemu command:
> 
> qemu-system-x86_64 [...]
>     -object iommufd,id=iommufd0 \
>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> 
> 
> Because the new features like cut_mapping operation will only be support in 
> iommufd.
> It is more recommended to use the iommufd-backed VFIO.
> 
> Related link
> ============
> [1] 
> https://lore.kernel.org/qemu-devel/20250520102856.132417-1-chenyi.qi...@intel.com/
> [2] 
> https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonz...@redhat.com/
> [3] 
> https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_...@nvidia.com/
> [4] 
> https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090...@redhat.com/
> [5] 
> https://lore.kernel.org/qemu-devel/013b36a9-9310-4073-b54c-9c511f23d...@linux.intel.com/
> 
> Chenyi Qiang (5):
>   memory: Export a helper to get intersection of a MemoryRegionSection
>     with a given range
>   memory: Change memory_region_set_ram_discard_manager() to return the
>     result
>   memory: Unify the definiton of ReplayRamPopulate() and
>     ReplayRamDiscard()
>   ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock
>     with guest_memfd
>   physmem: Support coordinated discarding of RAM with guest_memfd
> 
>  MAINTAINERS                   |   1 +
>  accel/kvm/kvm-all.c           |   9 +
>  hw/virtio/virtio-mem.c        |  83 +++---
>  include/system/memory.h       | 100 +++++--
>  include/system/ramblock.h     |  22 ++
>  migration/ram.c               |   5 +-
>  system/memory.c               |  22 +-
>  system/meson.build            |   1 +
>  system/physmem.c              |  18 +-
>  system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
>  system/trace-events           |   3 +
>  11 files changed, 660 insertions(+), 84 deletions(-)
>  create mode 100644 system/ram-block-attributes.c
> 


Reply via email to