postcopy: Report fault latencies in blocktime

Peter Xu Mon, 02 Jun 2025 09:30:17 -0700

On Mon, Jun 02, 2025 at 11:26:36AM +0200, Markus Armbruster wrote:
> Peter Xu <pet...@redhat.com> writes:
> 
> > Blocktime so far only cares about the time one vcpu (or the whole system)
> > got blocked.  It would be also be helpful if it can also report the latency
> > of page requests, which could be very sensitive during postcopy.
> >
> > Blocktime itself is sometimes not very important, especially when one
> > thinks about KVM async PF support, which means vCPUs are literally almost
> > not blocked at all because the guest OS is smart enough to switch to
> > another task when a remote fault is needed.
> >
> > However, latency is still sensitive and important because even if the guest
> > vCPU is running on threads that do not need a remote fault, the workload
> > that accesses some missing page is still affected.
> >
> > Add two entries to the report, showing how long it takes to resolve a
> > remote fault.  Mention in the QAPI doc that this is not the real average
> > fault latency, but only the ones that was requested for a remote fault.
> >
> > Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
> > meanwhile add the entry checks in qtests for all postcopy tests.
> >
> > Cc: Markus Armbruster <arm...@redhat.com>
> > Cc: Dr. David Alan Gilbert <d...@treblig.org>
> > Signed-off-by: Peter Xu <pet...@redhat.com>
> > ---
> >  qapi/migration.json                   | 13 +++++
> >  migration/migration-hmp-cmds.c        | 70 ++++++++++++++++++---------
> >  migration/postcopy-ram.c              | 48 ++++++++++++------
> >  tests/qtest/migration/migration-qmp.c |  3 ++
> >  4 files changed, 97 insertions(+), 37 deletions(-)
> >
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 8b9c53595c..8b13cea169 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -236,6 +236,17 @@
> >  #     This is only present when the postcopy-blocktime migration
> >  #     capability is enabled.  (Since 3.0)
> >  #
> > +# @postcopy-latency: average remote page fault latency (in us).  Note that
> > +#     this doesn't include all faults, but only the ones that require a
> > +#     remote page request.  So it should be always bigger than the real
> > +#     average page fault latency. This is only present when the
> > +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> > +#
> > +# @postcopy-vcpu-latency: average remote page fault latency per vCPU (in
> > +#     us).  It has the same definition of @postcopy-latency, but instead
> > +#     this is the per-vCPU statistics. This is only present when the
> 
> Two spaces between sentences for consistency, please.


Fixed.  There's another similar occurance in the last patch, I'll fix that
too.

> 
> > +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> 
> I figure the the @i-th array element is for vCPU with index @i.  Correct?
> 
> This is also only present when @postcopy-blocktime is enabled.  Correct?

Correct on both.

> 
> Could a QMP client compute @postcopy-latency from
> @postcopy-vcpu-latency?

Not with the current API.

Right now, the reported values are per-vCPU average latencies and global
average latencies, not yet per-vCPU fault counts. Per-vCPU fault counts
will be needed to do the calculation.

I chose to export global average latency only because that should be the
most important one to me as of now.  The per-vCPU results are pretty much
side effect of how blocktime feature does accounting so far (which is based
on per-vCPU), so it's very low hanging fruit.

Thanks,

-- 
Peter Xu

Re: [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime

Reply via email to