Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls

Peter Xu Thu, 18 Dec 2025 07:34:18 -0800

On Thu, Dec 18, 2025 at 05:20:19PM +0800, Chuang Xu wrote:
> On 17/12/2025 22:59, Peter Xu wrote:
> > Right, it will, because any time used for sync has the vCPUs running, so
> > that will contributes to the total dirtied pages, hence partly increase D,
> > as you pointed out.
> >
> > But my point is, if you _really_ have R=B all right, you should e.g. on a
> > 10Gbps NIC seeing R~=10Gbps.  If R is not wire speed, it means the R is not
> > really correctly measured..
> 
> In my experience, the bandwidth of live migration usually doesn't reach
> the nic's bandwidth limit (my test environment's nic bandwidth limit is 
> 200Gbps).
> This could be due to various reasons: for example, the live migration main 
> thread's
> ability to search for dirty pages may have reached a bottleneck;
> the nic's interrupt binding range might limit the softirq's processing 
> capacity;
> there might be too few multifd threads; or there might be overhead in 
> synchronizing
> between the live migration main thread and the multifd thread.


Exactly, especially when you have 200Gbps NICs.

I hope I have some of those for testing too!  I don't, so I can't provide
really useful input..  My vague memory (I got some chance using a 100Gbps
NIC, if I recall correctly) is that main thread will bottleneck already
there, where I should have (maybe?) 8 multifd threads.

I just never knew whether we need to scale it out yet so far, normally
100G/200G setup only happens with direct attached, not a major use case for
cluster setup?  Or maybe I am outdated?

If that'll be a major use case at some point, and if main thread is the
bottleneck distributing things, then we need to scale it out.  I think it's
doable.

> 
> >
> > I think it's likely impossible to measure the correct R so that it'll equal
> > to B, however IMHO we can still think about something that makes the R
> > getting much closer to B, then when normally y is a constant (default
> > 300ms, for example) it'll start to converge where it used to not be able to.
> 
> Yes, there are always various factors that can cause measurement errors.
> We can only try to make the calculated value as close as possible to the 
> actual value.
> 
> > E.g. QEMU can currently report R as low as 10Mbps even if on 10Gbps, IMHO
> > it'll be much better and start solving a lot of such problems if it can
> > start to report at least a few Gbps based on all kinds of methods
> > (e.g. excluding sync, as you experimented), then even if it's not reporting
> > 10Gbps it'll help.
> >
> After I applied these optimizations, typically the bandwidth statistics
> from QEMU and the real-time nic bandwidth monitored by atop are close.
> 
> Those extremely low bandwidth(but consistent with atop monitoring) is usually
> caused by zero pages or dirty pages with extremely high compression rates.
> In these cases, QEMU uses very little nic bandwidth to transmit a large number
> of dirty pages, but the bandwidth is only calculated based on the actual
> amount of data transmitted.

Yes.  That's a major issue in QEMU, zero page / compressed page / ... not
only affects how QEMU "measures" the mbps, but also affects how QEMU
decides when to converge: here I'm not talking about the bw difference
causing "bw * downtime_limit" [A] too small.  I'm talking about the other
side of equation where we used [A] to compare with "remain_dirty_pages *
psize" [B].  In reality, [B] isn't accurate either when zero page /
compressed page / ... is used..

Maybe.. the switchover decision shouldn't be MBps as unit, but "number of
pages".  It'll remove most of those effects at least, but that needs some
more considerations..

> 
> If we want to use the actual number of dirty pages transmitted to calculate
> bandwidth, we face another risk: if the dirty pages transmitted before the
> downtime have a high compression ratio, and the dirty pages to be transmitted
> after the downtime have a low compression ratio, then the downtime will far
> exceed expectations.

... like what you mentioned here will also be an issue if we switch to use
n_pages to do the math. :)

> 
> This may have strayed a bit, but just providing some potentially useful 
> information
> from my perspective.

Not really; patch alone is good, I appreciate the discussions.

Thanks,

-- 
Peter Xu

Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls

Reply via email to