Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

Jonah Palmer Mon, 07 Jul 2025 06:48:18 -0700



On 7/4/25 11:00 AM, Markus Armbruster wrote:

Jonah Palmer <jonah.pal...@oracle.com> writes:

On 6/26/25 8:08 AM, Markus Armbruster wrote:


[...]

Apologies for the delay in getting back to you. I just wanted to be thorough 
and answer everything as accurately and clearly as possible.

----

Before these patches, pinning started in vhost_vdpa_dev_start(), where the 
memory listener was registered, and began calling 
vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This 
happens after entering qemu_main_loop().

After these patches, pinning started in vhost_dev_init() (specifically 
vhost_vdpa_set_owner()), where the memory listener registration was moved to. 
This happens *before* entering qemu_main_loop().

However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The 
pinning that happens before we enter qemu_main_loop() is the full guest RAM 
pinning, which is the main, heavy lifting work when it comes to pinning memory.

The rest of the pinning work happens after entering qemu_main_loop() 
(approximately around the same timing as when pinning started before these 
patches). But, since we already did the heavy lifting of the pinning work pre 
qemu_main_loop() (e.g. all pages were already allocated and pinned), we're just 
re-pinning here (i.e. kernel just updates its IOTLB tables for pages that're 
already mapped and locked in RAM).

This makes the pinning work we do after entering qemu_main_loop() much faster 
than compared to the same pinning we had to do before these patches.

However, we have to pay a cost for this. Because we do the heavy lifting work 
earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the 
guest hasn't yet touched its memory yet, all host pages are still anonymous and 
unallocated. This essentially means that doing the pinning earlier is more 
expensive time-wise given that we need to also allocate physical pages for each 
chunk of memory.

To (hopefully) show this more clearly, I ran some tests before and after these 
patches and averaged the results. I used a 50G guest with real vDPA hardware 
(Mellanox CX-6Dx):

0.) How many vhost_vdpa_listener_region_add() (pins) calls?

                | Total | Before qemu_main_loop | After qemu_main_loop
_____________________________________________________________________
Before patches |   6   |         0             |         6
---------------|-----------------------------------------------------
After patches  |   11  |         5             |         6

- After the patches, this looks like we doubled the work we're doing (given the 
extra 5 calls), however, the 6 calls that happen after entering 
qemu_main_loop() are essentially replays of the first 5 we did.

  * In other words, after the patches, the 6 calls made after entering 
qemu_main_loop() are performed much faster than the same 6 calls before the 
patches.

  * From my measurements, these are the timings it took to perform those 6 
calls after entering qemu_main_loop():
    > Before patches: 0.0770s
    > After patches:  0.0065s

---

1.) Time from starting the guest to entering qemu_main_loop():
  * Before patches: 0.112s
  * After patches:  3.900s

- This is due to the 5 early pins we're doing now with these patches, whereas 
before we never did any pinning work at all.

- From measuring the time between the first and last 
vhost_vdpa_listener_region_add() calls during this period, this comes out to 
~3s for the early pinning.


So, total time increases: early pinning (before main loop) takes more
time than we save pinning (in the main loop).  Correct?

Correct. We only save ~0.07s from the pinning that happens in the mainloop. But the extra 3s we now need to spend pinning beforeqemu_main_loop() overshadows it.

We want this trade, because the time spent in the main loop is a
problem: guest-visible downtime.  Correct?

[...]

Correct. Though whether or not we want this trade I suppose issubjective. But the 50-60% reduction in guest-visible downtime is prettynice if we can stomach the initial startup costs.

Let's see whether I understand...  Please correct my mistakes.

Memory pinning takes several seconds for large guests.

Your patch makes pinning much slower.  You're theorizing this is because
pinning cold memory is slower than pinning warm memory.

I suppose the extra time is saved elsewhere, i.e. the entire startup
time remains roughly the same.  Have you verified this experimentally?


Based on my measurements that I did, we pay a ~3s increase in initialization 
time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning 
earlier for a vhost-vDPA device. This resulted in:

* Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).

* Shorter downtime phase during live migration (see below).

* Slight increase in time for the device to be operational (e.g. guest sets 
DRIVER_OK).
   > This measured the start time of the guest to guest setting DRIVER_OK for 
the device:

     Before patches: 22.46s
     After patches:  23.40s

The real timesaver here is the guest-visisble downtime during live migration 
(when using a vhost-vDPA device). Since the heavy lifting of the memory pinning 
is done during the initialization phase, it's no longer included as part of the 
stop-and-copy phase, which results in a much shorter guest-visible downtime.

 From v5's CV:

Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs,
the series reduces guest-visible downtime during back-to-back live
migrations by more than half:
- 39G VM:   4.72s -> 2.09s (-2.63s, ~56% improvement)
- 128G VM:  14.72s -> 5.83s (-8.89s, ~60% improvement)

Essentially, we pay a slight increased startup time tax to buy ourselves a much 
shorter downtime window when we want to perform a live migration with a 
vhost-vDPA networking device.

Your stated reason for moving the pinning is moving it from within
migration downtime to before migration downtime.  I understand why
that's useful.

You mentioned "a small drawback [...] a time in the initialization where
QEMU cannot respond to QMP".  Here's what I've been trying to figure out
about this drawback since the beginning:

* Under what circumstances is QMP responsiveness affected?  I *guess*
   it's only when we start a guest with plenty of memory and a certain
   vhost-vdpa configuration.  What configuration exactly?


Regardless of these patches, as I understand it, QMP cannot actually run any 
command that requires the BQL while we're pinning memory (memory pinning needs 
to use the lock).

However, the BQL is not held during the entirety of the pinning process. That 
is, it's periodically released throughout the entire pinning process. But those 
windows are *very* short and are only caught if you're really hammering QMP 
with commands very rapidly.

 From a realistic point of view, it's more practical to think of QMP being 
fully ready once all pinning has finished, e.g. time_spent_memory_pinning ≈ 
time_QMP_is_blocked.

---

As I understand it, QMP is not fully ready and cannot service requests until 
early on in qemu_main_loop().


It's a fair bit more complicated than that, but it'll do here.

Given that these patches increase the time it takes to reach qemu_main_loop() 
(due to the early pinning work), this means that QMP will also be delayed for 
this time.

I created a test that hammers QMP with commands until it's able to properly 
service the request and recorded how long it took from guest start to when it 
was able to fulfill the request:
  * Before patches: 0.167s
  * After patches:  4.080s

This aligns with time measured to reach qemu_main_loop() and the time we're 
spending doing the early memory pinning.

All in all, the larger the amount of memory we need to pin, the longer it will 
take for us to reach qemu_main_loop(), the larger time_spent_memory_pinning 
will be, and thus the longer it will take for QMP to be ready and fully 
functional.

----

I don't believe this related to any specific vhost-vDPA configuration. I think 
bottom line is that if we're using a vhost-vDPA device, we'll be spending more 
time to reach qemu_main_loop(), so QMP has to wait until we get there.


Let me circle back to my question: Under what circumstances is QMP
responsiveness affected?

The answer seems to be "only when we're using a vhost-vDPA device".
Correct?

Correct, since using one of these guys causes us to do this memorypinning. If we're not using one, it's business as usual for Qemu.

We're using one exactly when QEMU is running with one of its
vhost-vdpa-device-pci* device models.  Correct?


Yea, or something like:

-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \
-device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \

* How is QMP responsiveness affected?  Delay until the QMP greeting is
   sent?  Delay until the first command gets processed?  Delay at some
   later time?


Responsiveness: Longer initial delay due to early pinning work we need to do 
before we can bring QMP up.

Greeting delay: No greeting delay. Greeting is flushed earlier, even before we 
start the early pinning work.

* For both before and after the patches, this was ~0.052s for me.

Delay until first command processed: Longer initial delay at startup.

Delay at later time: None.


Got it.

* What's the absolute and the relative time of QMP non-responsiveness?
   0.2s were mentioned.  I'm looking for something like "when we're not
   pinning, it takes 0.8s until the first QMP command is processed, and
   when we are, it takes 1.0s".


The numbers below are based on my recent testing and measurements. This was 
with a 50G guest with real vDPA hardware.

Before patches:
---------------
* From the start time of the guest to the earliest time QMP is able to process 
a request (e.g. query-status): 0.167s.
   > This timing is pretty much the same regardless of whether or not we're 
pinning memory.

* Time spent pinning memory (QMP cannot handle requests during this window): 
0.077s.

After patches:
--------------
* From the start time of the guest to the earliest time QMP is able to process 
a request (e.g. query-status): 4.08s
   > If we're not early pinning memory, it's ~0.167s.

* Time spent pinning memory *after entering qemu_main_loop()* (QMP cannot 
handle requests during this window): 0.0065s.


Let me recap:

* No change at all unless we're pinning memory early, and we're doing
   that only when we're using a vhost-vDPA device.  Correct?

* If we are using a vhost-vDPA device:

   - Total startup time (until we're done pinning) increases.


Correct.

   - QMP becomes available later.


Correct.

   - Main loop behavior improves: less guest-visible downtime, QMP more
     responsive (once it's available)

Correct. Though the improvement is modest at best if we put aside theguest-visible downtime improvement.

   This is a tradeoff we want always.  There is no need to let users pick
   "faster startup, worse main loop behavior."

"Always" might be subjective here. For example, if there's no desire toperform live migration, then the user kinda just gets stuck with the cons.

Whether or not we want to make this configurable though is anotherdiscussion.

Correct?

[...]

Re: [PATCH v4 0/7] Move memory listener register to vhost_vdpa_init

Reply via email to