On 6/26/25 8:08 AM, Markus Armbruster wrote:
Jonah Palmer <jonah.pal...@oracle.com> writes:

On 6/2/25 4:29 AM, Markus Armbruster wrote:
Butterfingers...  let's try this again.

Markus Armbruster<arm...@redhat.com> writes:

Si-Wei Liu<si-wei....@oracle.com> writes:

On 5/26/2025 2:16 AM, Markus Armbruster wrote:
Si-Wei Liu<si-wei....@oracle.com> writes:

On 5/15/2025 11:40 PM, Markus Armbruster wrote:
Jason Wang<jasow...@redhat.com> writes:

On Thu, May 8, 2025 at 2:47 AM Jonah Palmer<jonah.pal...@oracle.com> wrote:
Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.
Adding Markus to see if this is a real problem or not.
I guess the answer is "depends", and to get a more useful one, we need
more information.

When all you care is time from executing qemu-system-FOO to guest
finish booting, and the guest takes 10s to boot, then an extra 0.2s
won't matter much.
There's no such delay of an extra 0.2s or higher per se, it's just shifting 
around the page pinning hiccup, no matter it is 0.2s or something else, from 
the time of guest booting up to before guest is booted. This saves back guest 
boot time or start up delay, but in turn the same delay effectively will be 
charged to VM launch time. We follow the same model with VFIO, which would see 
the same hiccup during launch (at an early stage where no real mgmt software 
would care about).

When a management application runs qemu-system-FOO several times to
probe its capabilities via QMP, then even milliseconds can hurt.

Not something like that, this page pinning hiccup is one time only that occurs 
in the very early stage when launching QEMU, i.e. there's no consistent delay 
every time when QMP is called. The delay in QMP response at that very point 
depends on how much memory the VM has, but this is just specif to VM with VFIO 
or vDPA devices that have to pin memory for DMA. Having said, there's no extra 
delay at all if QEMU args has no vDPA device assignment, on the other hand, 
there's same delay or QMP hiccup when VFIO is around in QEMU args.

In what scenarios exactly is QMP delayed?
Having said, this is not a new problem to QEMU in particular, this QMP delay is 
not peculiar, it's existent on VFIO as well.

In what scenarios exactly is QMP delayed compared to before the patch?

The page pinning process now runs in a pretty early phase at
qemu_init() e.g. machine_run_board_init(),

It runs within

      qemu_init()
          qmp_x_exit_preconfig()
              qemu_init_board()
                  machine_run_board_init()

Except when --preconfig is given, it instead runs within QMP command
x-exit-preconfig.

Correct?

before any QMP command can be serviced, the latter of which typically
would be able to get run from qemu_main_loop() until the AIO gets
chance to be started to get polled and dispatched to bh.

We create the QMP monitor within qemu_create_late_backends(), which runs
before qmp_x_exit_preconfig(), but commands get processed only in the
main loop, which we enter later.

Correct?

Technically it's not a real delay for specific QMP command, but rather
an extended span of initialization process may take place before the
very first QMP request, usually qmp_capabilities, will be
serviced. It's natural for mgmt software to expect initialization
delay for the first qmp_capabilities response if it has to immediately
issue one after launching qemu, especially when you have a large guest
with hundred GBs of memory and with passthrough device that has to pin
memory for DMA e.g. VFIO, the delayed effect from the QEMU
initialization process is very visible too.

The work clearly needs to be done.  Whether it needs to be blocking
other things is less clear.

Even if it doesn't need to be blocking, we may choose not to avoid
blocking for now.  That should be an informed decision, though.

All I'm trying to do here is understand the tradeoffs, so I can give
useful advice.

                                              On the other hand, before
the patch, if memory happens to be in the middle of being pinned, any
ongoing QMP can't be serviced by the QEMU main loop, either.

When exactly does this pinning happen before the patch?  In which
function?

Before the patches, the memory listener was registered in
vhost_vdpa_dev_start(), well after device initialization.

And by device initialization here I mean the
qemu_create_late_backends() function.

With these patches, the memory listener is now being
registered in vhost_vdpa_set_owner(), called from
vhost_dev_init(), which is part of the device
initialization phase.

However, even though the memory_listener_register() is
called during the device initialization phase, the actual
pinning happens (very shortly) after
qemu_create_late_backends() returns (due to RAM being
initialized later).

---

So, without these patches, and based on my measurements,
memory pinning starts ~2.9s after qemu_create_late_backends()
returns.

With these patches, memory pinning starts ~0.003s after
qemu_create_late_backends() returns.

So, we're registering the memory listener earlier, which makes it do its
expensive work (pinning) earlier ("very shortly after
qemu_create_late_backends()).  I still don't understand where exactly
the pinning happens (where at runtime and where in the code).  Not sure
I have to.


Apologies for the delay in getting back to you. I just wanted to be thorough and answer everything as accurately and clearly as possible.

----

Before these patches, pinning started in vhost_vdpa_dev_start(), where the memory listener was registered, and began calling vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This happens after entering qemu_main_loop().

After these patches, pinning started in vhost_dev_init() (specifically vhost_vdpa_set_owner()), where the memory listener registration was moved to. This happens *before* entering qemu_main_loop().

However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The pinning that happens before we enter qemu_main_loop() is the full guest RAM pinning, which is the main, heavy lifting work when it comes to pinning memory.

The rest of the pinning work happens after entering qemu_main_loop() (approximately around the same timing as when pinning started before these patches). But, since we already did the heavy lifting of the pinning work pre qemu_main_loop() (e.g. all pages were already allocated and pinned), we're just re-pinning here (i.e. kernel just updates its IOTLB tables for pages that're already mapped and locked in RAM).

This makes the pinning work we do after entering qemu_main_loop() much faster than compared to the same pinning we had to do before these patches.

However, we have to pay a cost for this. Because we do the heavy lifting work earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the guest hasn't yet touched its memory yet, all host pages are still anonymous and unallocated. This essentially means that doing the pinning earlier is more expensive time-wise given that we need to also allocate physical pages for each chunk of memory.

To (hopefully) show this more clearly, I ran some tests before and after these patches and averaged the results. I used a 50G guest with real vDPA hardware (Mellanox CX-6Dx):

0.) How many vhost_vdpa_listener_region_add() (pins) calls?

               | Total | Before qemu_main_loop | After qemu_main_loop
_____________________________________________________________________
Before patches |   6   |         0             |         6
---------------|-----------------------------------------------------
After patches  |   11  |         5             |         6

- After the patches, this looks like we doubled the work we're doing (given the extra 5 calls), however, the 6 calls that happen after entering qemu_main_loop() are essentially replays of the first 5 we did.

* In other words, after the patches, the 6 calls made after entering qemu_main_loop() are performed much faster than the same 6 calls before the patches.

* From my measurements, these are the timings it took to perform those 6 calls after entering qemu_main_loop():
   > Before patches: 0.0770s
   > After patches:  0.0065s

---

1.) Time from starting the guest to entering qemu_main_loop():
 * Before patches: 0.112s
 * After patches:  3.900s

- This is due to the 5 early pins we're doing now with these patches, whereas before we never did any pinning work at all.

- From measuring the time between the first and last vhost_vdpa_listener_region_add() calls during this period, this comes out to ~3s for the early pinning.

I'd also like to highlight that without this patch, the pretty high
delay due to page pinning is even visible to the guest in addition to
just QMP delay, which largely affected guest boot time with vDPA
device already. It is long standing, and every VM user with vDPA
device would like to avoid such high delay for the first boot, which
is not seen with similar device e.g. VFIO passthrough.

I understand that hiding the delay from the guest could be useful.

Thanks,
-Siwei

You told us an absolute delay you observed.  What's the relative delay,
i.e. what's the delay with and without these patches?

Can you answer this question?

I thought I already got that answered in earlier reply. The relative
delay is subject to the size of memory. Usually mgmt software won't be
able to notice, unless the guest has more than 100GB of THP memory to
pin, for DMA or whatever reason.

Alright, what are the delays you observe with and without these patches
for three test cases that pin 50 / 100 / 200 GiB of THP memory
respectively?

So with THP memory specifically, these are my measurements below.
For these measurements, I simply started up a guest, traced the
vhost_vdpa_listener_region_add() calls, and found the difference
in time between the first and last calls. In other words, this is
roughly the time it took to pin all of guest memory. I did 5 runs
for each memory size:

Before patches:
===============
50G:   7.652s,  7.992s,  7.981s,  7.631s,  7.953s (Avg.  7.841s)
100G:  8.990s,  8.656s,  9.003s,  8.683s,  8.669s (Avg.  8.800s)
200G: 10.705s, 10.841s, 10.816s, 10.772s, 10.818s (Avg. 10.790s)

After patches:
==============
50G:  12.091s, 11.685s, 11.626s, 11.952s, 11.656s (Avg. 11.802s)
100G: 14.121s, 14.079s, 13.700s, 14.023s, 14.130s (Avg. 14.010s)
200G: 18.134s, 18.350s, 18.387s, 17.800s, 18.401s (Avg. 18.214s)

The reason we're seeing a jump here may be that with the memory
pinning happening earlier, the pinning happens before Qemu has
fully faulted in the guest's RAM.

As far as I understand, before these patches, by the time we
reached vhost_vdpa_dev_start(), all pages were already resident
(and THP splits already happened with the prealloc=on step), so
get_user_pages() pinned "warm" pages much faster.

With these patches, the memory listener is running on cold memory.
Every get_user_pages() call would fault in its 4KiB subpage (and
if THP was folded, split a 2MiB hugepage) before handing in a
'struct page'.

Let's see whether I understand...  Please correct my mistakes.

Memory pinning takes several seconds for large guests.

Your patch makes pinning much slower.  You're theorizing this is because
pinning cold memory is slower than pinning warm memory.

I suppose the extra time is saved elsewhere, i.e. the entire startup
time remains roughly the same.  Have you verified this experimentally?


Based on my measurements that I did, we pay a ~3s increase in initialization time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning earlier for a vhost-vDPA device. This resulted in:

* Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).

* Shorter downtime phase during live migration (see below).

* Slight increase in time for the device to be operational (e.g. guest sets DRIVER_OK). > This measured the start time of the guest to guest setting DRIVER_OK for the device:

    Before patches: 22.46s
    After patches:  23.40s

The real timesaver here is the guest-visisble downtime during live migration (when using a vhost-vDPA device). Since the heavy lifting of the memory pinning is done during the initialization phase, it's no longer included as part of the stop-and-copy phase, which results in a much shorter guest-visible downtime.

From v5's CV:

Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs,
the series reduces guest-visible downtime during back-to-back live
migrations by more than half:
- 39G VM:   4.72s -> 2.09s (-2.63s, ~56% improvement)
- 128G VM:  14.72s -> 5.83s (-8.89s, ~60% improvement)

Essentially, we pay a slight increased startup time tax to buy ourselves a much shorter downtime window when we want to perform a live migration with a vhost-vDPA networking device.

Your stated reason for moving the pinning is moving it from within
migration downtime to before migration downtime.  I understand why
that's useful.

You mentioned "a small drawback [...] a time in the initialization where
QEMU cannot respond to QMP".  Here's what I've been trying to figure out
about this drawback since the beginning:

* Under what circumstances is QMP responsiveness affected?  I *guess*
   it's only when we start a guest with plenty of memory and a certain
   vhost-vdpa configuration.  What configuration exactly?


Regardless of these patches, as I understand it, QMP cannot actually run any command that requires the BQL while we're pinning memory (memory pinning needs to use the lock).

However, the BQL is not held during the entirety of the pinning process. That is, it's periodically released throughout the entire pinning process. But those windows are *very* short and are only caught if you're really hammering QMP with commands very rapidly.

From a realistic point of view, it's more practical to think of QMP being fully ready once all pinning has finished, e.g. time_spent_memory_pinning ≈ time_QMP_is_blocked.

---

As I understand it, QMP is not fully ready and cannot service requests until early on in qemu_main_loop().

Given that these patches increase the time it takes to reach qemu_main_loop() (due to the early pinning work), this means that QMP will also be delayed for this time.

I created a test that hammers QMP with commands until it's able to properly service the request and recorded how long it took from guest start to when it was able to fulfill the request:
 * Before patches: 0.167s
 * After patches:  4.080s

This aligns with time measured to reach qemu_main_loop() and the time we're spending doing the early memory pinning.

All in all, the larger the amount of memory we need to pin, the longer it will take for us to reach qemu_main_loop(), the larger time_spent_memory_pinning will be, and thus the longer it will take for QMP to be ready and fully functional.

----

I don't believe this related to any specific vhost-vDPA configuration. I think bottom line is that if we're using a vhost-vDPA device, we'll be spending more time to reach qemu_main_loop(), so QMP has to wait until we get there.

* How is QMP responsiveness affected?  Delay until the QMP greeting is
   sent?  Delay until the first command gets processed?  Delay at some
   later time?


Responsiveness: Longer initial delay due to early pinning work we need to do before we can bring QMP up.

Greeting delay: No greeting delay. Greeting is flushed earlier, even before we start the early pinning work.

* For both before and after the patches, this was ~0.052s for me.

Delay until first command processed: Longer initial delay at startup.

Delay at later time: None.

* What's the absolute and the relative time of QMP non-responsiveness?
   0.2s were mentioned.  I'm looking for something like "when we're not
   pinning, it takes 0.8s until the first QMP command is processed, and
   when we are, it takes 1.0s".


The numbers below are based on my recent testing and measurements. This was with a 50G guest with real vDPA hardware.

Before patches:
---------------
* From the start time of the guest to the earliest time QMP is able to process a request (e.g. query-status): 0.167s. > This timing is pretty much the same regardless of whether or not we're pinning memory.

* Time spent pinning memory (QMP cannot handle requests during this window): 0.077s.

After patches:
--------------
* From the start time of the guest to the earliest time QMP is able to process a request (e.g. query-status): 4.08s
  > If we're not early pinning memory, it's ~0.167s.

* Time spent pinning memory *after entering qemu_main_loop()* (QMP cannot handle requests during this window): 0.0065s.

I believe this to be the case since in my measurements I noticed
some larger time gaps (fault + split overhead) in between some of
the vhost_vdpa_listener_region_add() calls.

However I'm still learning some of these memory pinning details,
so please let me know if I'm misunderstanding anything here.

Thank you!

[...]



Reply via email to