On 07/01/2025 06:55, Zhangfei Gao wrote:
> Hi, Joao
> 
> On Fri, Jun 23, 2023 at 5:51 AM Joao Martins <joao.m.mart...@oracle.com> 
> wrote:
>>
>> Hey,
>>
>> This series introduces support for vIOMMU with VFIO device migration,
>> particurlarly related to how we do the dirty page tracking.
>>
>> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
>> provide dma translation services for guests to provide some form of
>> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
>> required for big VMs with VFs with more than 255 vcpus. We tackle both
>> and remove the migration blocker when vIOMMU is present provided the
>> conditions are met. I have both use-cases here in one series, but I am happy
>> to tackle them in separate series.
>>
>> As I found out we don't necessarily need to expose the whole vIOMMU
>> functionality in order to just support interrupt remapping. x86 IOMMUs
>> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
>> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
>> can instantiate a IOMMU just for interrupt remapping without needing to
>> be advertised/support DMA translation. AMD IOMMU in theory can provide
>> the same, but Linux doesn't quite support the IR-only part there yet,
>> only intel-iommu.
>>
>> The series is organized as following:
>>
>> Patches 1-5: Today we can't gather vIOMMU details before the guest
>> establishes their first DMA mapping via the vIOMMU. So these first four
>> patches add a way for vIOMMUs to be asked of their properties at start
>> of day. I choose the least churn possible way for now (as opposed to a
>> treewide conversion) and allow easy conversion a posteriori. As
>> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
>> allows us to fetch PCI backing vIOMMU attributes, without necessarily
>> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
>> was doing in v3.
>>
>> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
>> DMA translation allowed. Today the 'dma-translation' attribute is
>> x86-iommu only, but the way this series is structured nothing stops from
>> other vIOMMUs supporting it too as long as they use
>> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
>> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
>> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
>> we've then tackled item (1) of the second paragraph.
> 
> Not understanding how to handle the device page table.
> 
> Does this mean after live-migration, the page table built by vIOMMU
> will be re-build in the target guest via pci_setup_iommu_ops?

AFAIU It is supposed to be done post loading the vIOMMU vmstate when enabling
the vIOMMU related MRs. And when walking the different 'emulated' address spaces
 it will replay all mappings (and skip non-present parts of the address space).

The trick in making this work largelly depends on individual vIOMMU
implementation (and this emulated vIOMMU stuff shouldn't be confused with IOMMU
nesting btw!). In intel case (and AMD will be similar) the root table pointer
that's part of the vmstate has all the device pagetables, which is just guest
memory that gets migrated over and enough to resolve VT-d/IVRS page walks.

The somewhat hard to follow part is that when it replays it walks all the whole
DMAR memory region and only notifies IOMMU MR listeners if there's a present PTE
or skip it. So at the end of the enabling of MRs the IOTLB gets reconstructed.
Though you would have to try to understand the flow with the vIOMMU you are 
using.

The replay in intel-iommu is triggered more or less this stack trace for a
present PTE:

vfio_iommu_map_notify
memory_region_notify_iommu_one
vtd_replay_hook
vtd_page_walk_one
vtd_page_walk_level
vtd_page_walk_level
vtd_page_walk_level
vtd_page_walk
vtd_iommu_replay
memory_region_iommu_replay
vfio_listener_region_add
address_space_update_topology_pass
address_space_set_flatview
memory_region_transaction_commit
vtd_switch_address_space
vtd_switch_address_space_all
vtd_post_load
vmstate_load_state
vmstate_load
qemu_loadvm_section_start_full
qemu_loadvm_state_main
qemu_loadvm_state
process_incoming_migration_co

Reply via email to