On 8/8/23 08:23, Avihai Horon wrote:
On 07/08/2023 18:53, Cédric Le Goater wrote:
External email: Use caution opening links or attachments
[ Adding Juan and Peter for their awareness ]
On 8/2/23 10:14, Avihai Horon wrote:
Changing the device state from STOP_COPY to STOP can take time as the
device may need to free resources and do other operations as part of the
transition. Currently, this is done in vfio_save_complete_precopy() and
therefore it is counted in the migration downtime.
To avoid this, change the device state from STOP_COPY to STOP in
vfio_save_cleanup(), which is called after migration has completed and
thus is not part of migration downtime.
What bothers me is that this looks like a device specific optimization
True, currently it helps mlx5, but this change is based on the assumption that,
in general, VFIO devices are likely to free resources when transitioning from
STOP_COPY to STOP.
So I think this is a good change to have in any case.
and we are loosing the error part.
I don't think we lose the error part.
AFAIU, the crucial part is transitioning to STOP_COPY and sending the final
data.
If that's done successfully, then migration is successful.
The STOP_COPY->STOP transition is done as part of the cleanup flow, after the
migration is completed -- i.e., failure in it does not affect the success of
migration.
Further more, if there is an error in the STOP_COPY->STOP transition, then it's
reported by vfio_migration_set_state().
It is indeed. I am nit-picking. Pushed on :
https://github.com/legoater/qemu/tree/vfio-next
It can still be updated before I send a PR. I also provided custom
rpms to our QE team for extras tests.
Should follow Dynamic MSI-X allocation [1] and Joao's series regarding
vIOMMU [2] but first I will take some PTO. See you in a couple of weeks !
Cheers,
C.
[1]
https://lore.kernel.org/qemu-devel/20230727072410.135743-1-jing2....@intel.com/
[2]
https://lore.kernel.org/qemu-devel/20230622214845.3980-1-joao.m.mart...@oracle.com/