Pre-copy support allows the VFIO device data to be transferred while the VM is running. This helps to accommodate VFIO devices that have a large amount of data that needs to be transferred, and it can reduce migration downtime.
Pre-copy support is optional in VFIO migration protocol v2. Implement pre-copy of VFIO migration protocol v2 and use it for devices that support it. Full description of it can be found here [1]. [1] https://lore.kernel.org/kvm/20221206083438.37807-3-yish...@nvidia.com/ Signed-off-by: Avihai Horon <avih...@nvidia.com> --- docs/devel/vfio-migration.rst | 29 ++++++--- include/hw/vfio/vfio-common.h | 3 + hw/vfio/common.c | 8 ++- hw/vfio/migration.c | 112 ++++++++++++++++++++++++++++++++-- hw/vfio/trace-events | 5 +- 5 files changed, 140 insertions(+), 17 deletions(-) diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst index 1d50c2fe5f..51f5e1a537 100644 --- a/docs/devel/vfio-migration.rst +++ b/docs/devel/vfio-migration.rst @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the destination host. This document details how saving and restoring of VFIO devices is done in QEMU. -Migration of VFIO devices currently consists of a single stop-and-copy phase. -During the stop-and-copy phase the guest is stopped and the entire VFIO device -data is transferred to the destination. - -The pre-copy phase of migration is currently not supported for VFIO devices. -Support for VFIO pre-copy will be added later on. +Migration of VFIO devices consists of two phases: the optional pre-copy phase, +and the stop-and-copy phase. The pre-copy phase is iterative and allows to +accommodate VFIO devices that have a large amount of data that needs to be +transferred. The iterative pre-copy phase of migration allows for the guest to +continue whilst the VFIO device state is transferred to the destination, this +helps to reduce the total downtime of the VM. VFIO devices can choose to skip +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl. A detailed description of the UAPI for VFIO device migration can be found in the comment for the ``vfio_device_mig_state`` structure in the header file @@ -29,6 +31,12 @@ VFIO implements the device hooks for the iterative approach as follows: driver, which indicates the amount of data that the vendor driver has yet to save for the VFIO device. +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is + active only if the VFIO device is in pre-copy states. + +* A ``save_live_iterate`` function that reads the VFIO device's data from the + vendor driver during iterative phase. + * A ``save_state`` function to save the device config space if it is present. * A ``save_live_complete_precopy`` function that sets the VFIO device in @@ -91,8 +99,10 @@ Flow of state changes during Live migration =========================================== Below is the flow of state change during live migration. -The values in the brackets represent the VM state, the migration state, and +The values in the parentheses represent the VM state, the migration state, and the VFIO device state, respectively. +The text in the square brackets represents the flow if the VFIO device supports +pre-copy. Live migration save path ------------------------ @@ -104,11 +114,12 @@ Live migration save path | migrate_init spawns migration_thread Migration thread then calls each device's .save_setup() - (RUNNING, _SETUP, _RUNNING) + (RUNNING, _SETUP, _RUNNING [_PRE_COPY]) | - (RUNNING, _ACTIVE, _RUNNING) + (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY]) If device is active, get pending_bytes by .save_live_pending() If total pending_bytes >= threshold_size, call .save_live_iterate() + [Data of VFIO device for pre-copy phase is copied] Iterate till total pending bytes converge and are less than threshold | On migration completion, vCPU stops and calls .save_live_complete_precopy for diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 5f8e7a02fe..88c2194fb9 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -67,7 +67,10 @@ typedef struct VFIOMigration { int data_fd; void *data_buffer; size_t data_buffer_size; + uint64_t mig_flags; uint64_t stop_copy_size; + uint64_t precopy_init_size; + uint64_t precopy_dirty_size; } VFIOMigration; typedef struct VFIOAddressSpace { diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 9a0dbee6b4..93b18c5e3d 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -357,7 +357,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container) if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) && (migration->device_state == VFIO_DEVICE_STATE_RUNNING || - migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) { + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) { return false; } } @@ -387,7 +389,9 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container) } if (migration->device_state == VFIO_DEVICE_STATE_RUNNING || - migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) { + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) { continue; } else { return false; diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 760f667e04..2a0a663023 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -69,6 +69,10 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state) return "RESUMING"; case VFIO_DEVICE_STATE_RUNNING_P2P: return "RUNNING_P2P"; + case VFIO_DEVICE_STATE_PRE_COPY: + return "PRE_COPY"; + case VFIO_DEVICE_STATE_PRE_COPY_P2P: + return "PRE_COPY_P2P"; default: return "UNKNOWN STATE"; } @@ -237,6 +241,11 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration) data_size = read(migration->data_fd, migration->data_buffer, migration->data_buffer_size); if (data_size < 0) { + /* Pre-copy emptied all the device state for now */ + if (errno == ENOMSG) { + return 1; + } + return -errno; } if (data_size == 0) { @@ -260,6 +269,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) VFIODevice *vbasedev = opaque; VFIOMigration *migration = vbasedev->migration; uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE; + int ret; qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE); @@ -273,6 +283,23 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) return -ENOMEM; } + if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) { + switch (migration->device_state) { + case VFIO_DEVICE_STATE_RUNNING: + ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY, + VFIO_DEVICE_STATE_RUNNING); + if (ret) { + return ret; + } + break; + case VFIO_DEVICE_STATE_STOP: + /* vfio_save_complete_precopy() will go to STOP_COPY */ + break; + default: + return -EINVAL; + } + } + trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size); qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); @@ -287,6 +314,12 @@ static void vfio_save_cleanup(void *opaque) g_free(migration->data_buffer); migration->data_buffer = NULL; + + if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) { + migration->precopy_init_size = 0; + migration->precopy_dirty_size = 0; + } + vfio_migration_cleanup(vbasedev); trace_vfio_save_cleanup(vbasedev->name); } @@ -301,9 +334,55 @@ static void vfio_save_pending(void *opaque, uint64_t threshold_size, *res_precopy_only += migration->stop_copy_size; + if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) { + if (migration->precopy_init_size) { + /* + * Initial size should be transferred during pre-copy phase so + * stop-copy phase will not be slowed down. Report threshold_size + * to force another pre-copy iteration. + */ + *res_precopy_only += threshold_size; + } else { + *res_precopy_only += migration->precopy_dirty_size; + } + } + trace_vfio_save_pending(vbasedev->name, *res_precopy_only, *res_postcopy_only, *res_compatible, - migration->stop_copy_size); + migration->stop_copy_size, + migration->precopy_init_size, + migration->precopy_dirty_size); +} + +static bool vfio_is_active_iterate(void *opaque) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + + return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P; +} + +static int vfio_save_iterate(QEMUFile *f, void *opaque) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + int ret; + + ret = vfio_save_block(f, migration); + if (ret < 0) { + return ret; + } + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + + trace_vfio_save_iterate(vbasedev->name); + + /* + * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero. + * Return 1 so following handlers will not be potentially blocked. + */ + return 1; } static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) @@ -312,7 +391,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) enum vfio_device_mig_state recover_state; int ret; - /* We reach here with device state STOP only */ + /* We reach here with device state STOP or STOP_COPY only */ recover_state = VFIO_DEVICE_STATE_STOP; ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY, recover_state); @@ -430,6 +509,8 @@ static const SaveVMHandlers savevm_vfio_handlers = { .save_setup = vfio_save_setup, .save_cleanup = vfio_save_cleanup, .save_live_pending = vfio_save_pending, + .is_active_iterate = vfio_is_active_iterate, + .save_live_iterate = vfio_save_iterate, .save_live_complete_precopy = vfio_save_complete_precopy, .save_state = vfio_save_state, .load_setup = vfio_load_setup, @@ -442,13 +523,19 @@ static const SaveVMHandlers savevm_vfio_handlers = { static void vfio_vmstate_change(void *opaque, bool running, RunState state) { VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; enum vfio_device_mig_state new_state; int ret; if (running) { new_state = VFIO_DEVICE_STATE_RUNNING; } else { - new_state = VFIO_DEVICE_STATE_STOP; + new_state = + ((migration->device_state == VFIO_DEVICE_STATE_PRE_COPY || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) && + (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ? + VFIO_DEVICE_STATE_STOP_COPY : + VFIO_DEVICE_STATE_STOP; } ret = vfio_migration_set_state(vbasedev, new_state, @@ -496,6 +583,9 @@ static int vfio_migration_data_notifier(NotifierWithReturn *n, void *data) { VFIOMigration *migration = container_of(n, VFIOMigration, migration_data); VFIODevice *vbasedev = migration->vbasedev; + struct vfio_precopy_info precopy = { + .argsz = sizeof(precopy), + }; PrecopyNotifyData *pnd = data; if (pnd->reason != PRECOPY_NOTIFY_AFTER_BITMAP_SYNC) { @@ -515,8 +605,21 @@ static int vfio_migration_data_notifier(NotifierWithReturn *n, void *data) migration->stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; } + if ((migration->device_state == VFIO_DEVICE_STATE_PRE_COPY || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) { + if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) { + migration->precopy_init_size = 0; + migration->precopy_dirty_size = 0; + } else { + migration->precopy_init_size = precopy.initial_bytes; + migration->precopy_dirty_size = precopy.dirty_bytes; + } + } + trace_vfio_migration_data_notifier(vbasedev->name, - migration->stop_copy_size); + migration->stop_copy_size, + migration->precopy_init_size, + migration->precopy_dirty_size); return 0; } @@ -588,6 +691,7 @@ static int vfio_migration_init(VFIODevice *vbasedev) migration->vbasedev = vbasedev; migration->device_state = VFIO_DEVICE_STATE_RUNNING; migration->data_fd = -1; + migration->mig_flags = mig_flags; oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj))); if (oid) { diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index db9cb94952..37724579e3 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -154,7 +154,7 @@ vfio_load_cleanup(const char *name) " (%s)" vfio_load_device_config_state(const char *name) " (%s)" vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d" -vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size) " (%s) stopcopy size 0x%"PRIx64 +vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 vfio_migration_probe(const char *name) " (%s)" vfio_migration_set_state(const char *name, const char *state) " (%s) state %s" vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s" @@ -162,6 +162,7 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d" vfio_save_cleanup(const char *name) " (%s)" vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" vfio_save_device_config_state(const char *name) " (%s)" -vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64" stopcopy size 0x%"PRIx64 +vfio_save_iterate(const char *name) " (%s)" +vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s" -- 2.26.3