On 04/10/2023 18:10, Peter Xu wrote: > Hi, Joao, > > On Tue, Sep 26, 2023 at 05:18:40PM +0100, Joao Martins wrote: >> Deliver the downtime breakdown also via `query-migrate` >> to allow users to understand what their downtime value >> represents. > > I agree downtime is an area we definitely need to improve.. however do we > need to make it part of qapi? Or do we need them mostly for debugging > purpose? > > Any introduction of motivation of this work, especially on exporting the > values to the mgmt apps? >
I added the statistics mainly for observability (e.g. you would grep in the libvirt logs for a non developer and they can understand how downtime is explained). I wasn't specifically thinking about management app using this, just broad access to the metrics. One can get the same level of observability with a BPF/dtrace/systemtap script, albeit in a less obvious way. With respect to motivation: I am doing migration with VFs and sometimes vhost-net, and the downtime/switchover is the only thing that is either non-determinisc or not captured in the migration math. There are some things that aren't accounted (e.g. vhost with enough queues will give you high downtimes), and algorithimally not really possible to account for as one needs to account every possible instruction when we quiesce the guest (or at least that's my understanding). Just having these metrics, help the developer *and* user see why such downtime is high, and maybe open up window for fixes/bug-reports or where to improve. Furthermore, hopefully these tracepoints or stats could be a starting point for developers to understand how much downtime is spent in a particular device in Qemu(as a follow-up to this series), or allow to implement bounds check limits in switchover limits in way that doesn't violate downtime-limit SLAs (I have a small set of patches for this). >> >> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com> >> --- >> qapi/migration.json | 22 ++++++++++++++++++++++ >> migration/migration.c | 14 ++++++++++++++ >> 2 files changed, 36 insertions(+) >> >> diff --git a/qapi/migration.json b/qapi/migration.json >> index 2d91fbcb22ff..088e1b2bf440 100644 >> --- a/qapi/migration.json >> +++ b/qapi/migration.json >> @@ -217,6 +217,27 @@ >> 'data': [ 'start', 'stop', 'precopy-iterable', 'precopy-noniterable', >> 'resume-return-path' ] } >> >> +## >> +# @DowntimeStats: >> +# >> +# Detailed migration downtime statistics. >> +# >> +# @stop: Time taken to stop the VM during switchover. >> +# >> +# @precopy: Time taken to save all precopy state during switchover. >> +# >> +# @precopy-iterable: Time taken to save all precopy iterable state. >> +# >> +# @precopy-noniterable: Time taken to save all precopy non iterable state. >> +# >> +# @resume-return-path: Time taken to resume if return path is enabled, >> +# otherwise zero. > > All these fields will more or less duplicate the ones in the other > MigrationDowntime just introduced. > > We suffer from duplicated fields for migration parameters, proof shows that > dropping the duplication is normally harder.. I'm trying to dedup the > existing Migration*Parameter* objects and still in progress, so even if we > want to expose downtime in qapi I hope we can expose only one and single > object. > Thanks for the background; I am now recalling your other series doing this sort of duplication[0] [0] https://lore.kernel.org/qemu-devel/20230905162335.235619-5-pet...@redhat.com/ > IIUC we can already do that by keeping DowntimeStats, keeing an object in > MigrationState, and just drop MigrationDowntime? > I can try that, sounds way cleaner. I didn't like the duplication either.