Cédric Le Goater <c...@redhat.com> writes: > On 2/27/25 23:01, Maciej S. Szmigiero wrote: >> On 27.02.2025 07:59, Cédric Le Goater wrote: >>> On 2/19/25 21:34, Maciej S. Szmigiero wrote: >>>> From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com> >>>> >>>> Update the VFIO documentation at docs/devel/migration describing the >>>> changes brought by the multifd device state transfer. >>>> >>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigi...@oracle.com> >>>> --- >>>> docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++---- >>>> 1 file changed, 71 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst >>>> index c49482eab66d..d9b169d29921 100644 >>>> --- a/docs/devel/migration/vfio.rst >>>> +++ b/docs/devel/migration/vfio.rst >>>> @@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO >>>> devices opt-in to pre-copy >>>> support by reporting the VFIO_MIGRATION_PRE_COPY flag in the >>>> VFIO_DEVICE_FEATURE_MIGRATION ioctl. >>> >>> Please add a new "multifd" documentation subsection at the end of the file >>> with this part : >>> >>>> +Starting from QEMU version 10.0 there's a possibility to transfer VFIO >>>> device >>>> +_STOP_COPY state via multifd channels. This helps reduce downtime - >>>> especially >>>> +with multiple VFIO devices or with devices having a large migration state. >>>> +As an additional benefit, setting the VFIO device to _STOP_COPY state and >>>> +saving its config space is also parallelized (run in a separate thread) in >>>> +such migration mode. >>>> + >>>> +The multifd VFIO device state transfer is controlled by >>>> +"x-migration-multifd-transfer" VFIO device property. This property >>>> defaults to >>>> +AUTO, which means that VFIO device state transfer via multifd channels is >>>> +attempted in configurations that otherwise support it. >>>> + >> >> Done - I also moved the parts about x-migration-max-queued-buffers >> and x-migration-load-config-after-iter description there since >> obviously they wouldn't make sense being left alone in the top section. >> >>> I was expecting a much more detailed explanation on the design too : >>> >>> * in the cover letter >>> * in the hw/vfio/migration-multifd.c >>> * in some new file under docs/devel/migration/ > > I forgot to add : > > * guide on how to use this new feature from QEMU and libvirt. > something we can refer to for tests. That's a must have. > * usage scenarios > There are some benefits but it is not obvious a user would > like to use multiple VFs in one VM, please explain. > This is a major addition which needs justification anyhow > * pros and cons > >> I'm not sure what descriptions you exactly want in these places, > > Looking from the VFIO subsystem, the way this series works is very opaque. > There are a couple of a new migration handlers, new threads, new channels, > etc. It has been discussed several times with migration folks, please provide > a summary for a new reader as ignorant as everyone would be when looking at > a new file. > > >> but since >> that's just documentation (not code) it could be added after the code >> freeze... > > That's the risk of not getting any ! and the initial proposal should be > discussed before code freeze. > > For the general framework, I was expecting an extension of a "multifd" > subsection under : > > https://qemu.readthedocs.io/en/v9.2.0/devel/migration/features.html > > but it doesn't exist :/
Hi, see if this helps. Let me know what can be improved and if something needs to be more detailed. Please ignore the formatting, I'll send a proper patch after the carnaval. @Maciej, it's probably better if you keep your docs separate anyway so we don't add another dependency. I can merge them later. multifd.rst: Multifd ======= Multifd is the name given for the migration capability that enables data transfer using multiple threads. Multifd supports all the transport types currently in use with migration (inet, unix, vsock, fd, file). Restrictions ------------ For migration to a file, support is conditional on the presence of the mapped-ram capability, see #mapped-ram. Snapshots are currently not supported. Postcopy migration is currently not supported. Usage ----- On both source and destination, enable the ``multifd`` capability: ``migrate_set_capability multifd on`` Define a number of channels to use (default is 2, but 8 usually provides best performance). ``migrate_set_parameter multifd-channels 8`` Components ---------- Multifd consists of: - A client that produces the data on the migration source side and consumes it on the destination. Currently the main client code is ram.c, which selects the RAM pages for migration; - A shared data structure (MultiFDSendData), used to transfer data between multifd and the client. On the source side, this structure is further subdivided into payload types (MultiFDPayload); - An API operating on the shared data structure to allow the client code to interact with multifd; - multifd_send/recv(): A dispatcher that transfers work to/from the channels. - multifd_*payload_* and MultiFDPayloadType: Support defining an opaque payload. The payload is always wrapped by MultiFDSend|RecvData. - multifd_send_data_*: Used to manage the memory for the shared data structure. - The threads that process the data (aka channels, due to a 1:1 mapping to QIOChannels). Each multifd channel supports callbacks that can be used for fine-grained processing of the payload, such as compression and zero page detection. - A packet which is the final result of all the data aggregation and/or transformation. The packet contains a header, a payload-specific header and a variable-size data portion. - The packet header: contains a magic number, a version number and flags that inform of special processing needed on the destination. - The payload-specific header: contains metadata referent to the packet's data portion, such as page counts. - The data portion: contains the actual opaque payload data. Note that due to historical reasons, the terminology around multifd packets is inconsistent. The mapped-ram feature ignores packets entirely. Theory of operation ------------------- The multifd channels operate in parallel with the main migration thread. The transfer of data from a client code into multifd happens from the main migration thread using the multifd API. The interaction between the client code and the multifd channels happens in the multifd_send() and multifd_recv() methods. These are reponsible for selecting the next idle channel and making the shared data structure containing the payload accessible to that channel. The client code receives back an empty object which it then uses for the next iteration of data transfer. The selection of idle channels is simply a round-robin over the idle channels (!p->pending_job). Channels wait at a semaphore, once a channel is released, it starts operating on the data immediately. Aside from eventually transmitting the data over the underlying QIOChannel, a channel's operation also includes calling back to the client code at pre-determined points to allow for client-specific handling such as data transformation (e.g. compression), creation of the packet header and arranging the data into iovs (struct iovec). Iovs are the type of data on which the QIOChannel operates. Client code (migration thread): 1. Populate shared structure with opaque data (ram pages, device state) 2. Call multifd_send() 2a. Loop over the channels until one is idle 2b. Switch pointers between client data and channel data 2c. Release channel semaphore 3. Receive back empty object 4. Repeat Multifd channel (multifd thread): 1. Channel idle 2. Gets released by multifd_send() 3. Call multifd_ops methods to fill iov 3a. Compression may happen 3b. Zero page detection may happen 3c. Packet is written 3d. iov is written 4. Pass iov into QIOChannel for transferring 5. Repeat The destination side operates similarly but with multifd_recv(), decompression instead of compression, etc. One important aspect is that when receiving the data, the iov will contain host virtual addresses, so guest memory is written to directly from multifd threads. About flags ----------- The main thread orchestrates the migration by issuing control flags on the migration stream (QEMU_VM_*). The main memory is migrated by ram.c and includes specific control flags that are also put on the main migration stream (RAM_SAVE_FLAG_*). Multifd has its own set of MULTIFD_FLAGs that are included into each packet. These may inform about properties such as the compression algorithm used if the data is compressed. Synchronization --------------- Since the migration process is iterative due to RAM dirty tracking, it is necessary to invalidate data that is no longer current (e.g. due to the source VM touching the page). This is done by having a synchronization point triggered by the migration thread at key points during the migration. Data that's received after the synchronization point is allowed to overwrite data received prior to that point. To perform the synchronization, multifd provides the multifd_send_sync_main() and multifd_recv_sync_main() helpers. These are called whenever the client code whishes to ensure that all data sent previously has now been received by the destination. The synchronization process involves performing a flush of the ramaining client data still left to be transmitted and issuing a multifd packet containing the MULTIFD_FLAG_SYNC flag. This flag informs the receiving end that it should finish reading the data and wait for a synchronization point. To complete the sync, the main migration stream issues a RAM_SAVE_FLAG_MULTIFD_FLUSH flag. When that flag is received by the destination, it ensures all of its channels have seen the MULTIFD_FLAG_SYNC and moves them to an idle state. The client code can then continue with a second round of data by issuing multifd_send() once again. The synchronization process also ensures that internal synchronization happens, i.e. between each thread. This is necessary to avoid threads lagging behind sending or receiving when the migration approaches completion. The mapped-ram feature has different synchronization requirements because it's an asynchronous migration (source and destination not migrating at the same time). For that feature, only the internal sync is relevant. Data transformation ------------------- Each multifd channel executes a set of callbacks before transmitting the data. These callbacks allow the client code to alter the data format right before sending and after receiving. Since the object of the RAM migration is always the memory page and the only processing done for memory pages is zero page detection, which is already part of compression in a sense, the multifd_ops functions are mutually exclusively divided into compression and no-compression. The migration without compression (i.e. regular ram migration) has a further specificity as mentioned of possibly doing zero page detection (see zero-page-detection migration parameter). This consists of sending all pages to multifd and letting the detection of a zero page happen in the multifd channels instead of doing it beforehand on the main migration thread as it was done in the past. Code structure -------------- Multifd code is divided into: The main file containing the core routines - multifd.c RAM migration - multifd-nocomp.c (nocomp, for "no compression") - multifd-zero-page.c - ram.c (also involved in non-multifd migrations + snapshots) Compressors - multifd-uadk.c - multifd-qatzip.c - multifd-zlib.c - multifd-qpl.c - multifd-zstd.c