On 7/4/25 11:00 AM, Markus Armbruster wrote:
Jonah Palmer <jonah.pal...@oracle.com> writes:
On 6/26/25 8:08 AM, Markus Armbruster wrote:
[...]
Apologies for the delay in getting back to you. I just wanted to be thorough
and answer everything as accurately and clearly as possible.
----
Before these patches, pinning started in vhost_vdpa_dev_start(), where the
memory listener was registered, and began calling
vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This
happens after entering qemu_main_loop().
After these patches, pinning started in vhost_dev_init() (specifically
vhost_vdpa_set_owner()), where the memory listener registration was moved to.
This happens *before* entering qemu_main_loop().
However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The
pinning that happens before we enter qemu_main_loop() is the full guest RAM
pinning, which is the main, heavy lifting work when it comes to pinning memory.
The rest of the pinning work happens after entering qemu_main_loop()
(approximately around the same timing as when pinning started before these
patches). But, since we already did the heavy lifting of the pinning work pre
qemu_main_loop() (e.g. all pages were already allocated and pinned), we're just
re-pinning here (i.e. kernel just updates its IOTLB tables for pages that're
already mapped and locked in RAM).
This makes the pinning work we do after entering qemu_main_loop() much faster
than compared to the same pinning we had to do before these patches.
However, we have to pay a cost for this. Because we do the heavy lifting work
earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the
guest hasn't yet touched its memory yet, all host pages are still anonymous and
unallocated. This essentially means that doing the pinning earlier is more
expensive time-wise given that we need to also allocate physical pages for each
chunk of memory.
To (hopefully) show this more clearly, I ran some tests before and after these
patches and averaged the results. I used a 50G guest with real vDPA hardware
(Mellanox CX-6Dx):
0.) How many vhost_vdpa_listener_region_add() (pins) calls?
| Total | Before qemu_main_loop | After qemu_main_loop
_____________________________________________________________________
Before patches | 6 | 0 | 6
---------------|-----------------------------------------------------
After patches | 11 | 5 | 6
- After the patches, this looks like we doubled the work we're doing (given the
extra 5 calls), however, the 6 calls that happen after entering
qemu_main_loop() are essentially replays of the first 5 we did.
* In other words, after the patches, the 6 calls made after entering
qemu_main_loop() are performed much faster than the same 6 calls before the
patches.
* From my measurements, these are the timings it took to perform those 6
calls after entering qemu_main_loop():
> Before patches: 0.0770s
> After patches: 0.0065s
---
1.) Time from starting the guest to entering qemu_main_loop():
* Before patches: 0.112s
* After patches: 3.900s
- This is due to the 5 early pins we're doing now with these patches, whereas
before we never did any pinning work at all.
- From measuring the time between the first and last
vhost_vdpa_listener_region_add() calls during this period, this comes out to
~3s for the early pinning.
So, total time increases: early pinning (before main loop) takes more
time than we save pinning (in the main loop). Correct?
Correct. We only save ~0.07s from the pinning that happens in the main
loop. But the extra 3s we now need to spend pinning before
qemu_main_loop() overshadows it.
We want this trade, because the time spent in the main loop is a
problem: guest-visible downtime. Correct?
[...]
Correct. Though whether or not we want this trade I suppose is
subjective. But the 50-60% reduction in guest-visible downtime is pretty
nice if we can stomach the initial startup costs.
Let's see whether I understand... Please correct my mistakes.
Memory pinning takes several seconds for large guests.
Your patch makes pinning much slower. You're theorizing this is because
pinning cold memory is slower than pinning warm memory.
I suppose the extra time is saved elsewhere, i.e. the entire startup
time remains roughly the same. Have you verified this experimentally?
Based on my measurements that I did, we pay a ~3s increase in initialization
time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning
earlier for a vhost-vDPA device. This resulted in:
* Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).
* Shorter downtime phase during live migration (see below).
* Slight increase in time for the device to be operational (e.g. guest sets
DRIVER_OK).
> This measured the start time of the guest to guest setting DRIVER_OK for
the device:
Before patches: 22.46s
After patches: 23.40s
The real timesaver here is the guest-visisble downtime during live migration
(when using a vhost-vDPA device). Since the heavy lifting of the memory pinning
is done during the initialization phase, it's no longer included as part of the
stop-and-copy phase, which results in a much shorter guest-visible downtime.
From v5's CV:
Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs,
the series reduces guest-visible downtime during back-to-back live
migrations by more than half:
- 39G VM: 4.72s -> 2.09s (-2.63s, ~56% improvement)
- 128G VM: 14.72s -> 5.83s (-8.89s, ~60% improvement)
Essentially, we pay a slight increased startup time tax to buy ourselves a much
shorter downtime window when we want to perform a live migration with a
vhost-vDPA networking device.
Your stated reason for moving the pinning is moving it from within
migration downtime to before migration downtime. I understand why
that's useful.
You mentioned "a small drawback [...] a time in the initialization where
QEMU cannot respond to QMP". Here's what I've been trying to figure out
about this drawback since the beginning:
* Under what circumstances is QMP responsiveness affected? I *guess*
it's only when we start a guest with plenty of memory and a certain
vhost-vdpa configuration. What configuration exactly?
Regardless of these patches, as I understand it, QMP cannot actually run any
command that requires the BQL while we're pinning memory (memory pinning needs
to use the lock).
However, the BQL is not held during the entirety of the pinning process. That
is, it's periodically released throughout the entire pinning process. But those
windows are *very* short and are only caught if you're really hammering QMP
with commands very rapidly.
From a realistic point of view, it's more practical to think of QMP being
fully ready once all pinning has finished, e.g. time_spent_memory_pinning ≈
time_QMP_is_blocked.
---
As I understand it, QMP is not fully ready and cannot service requests until
early on in qemu_main_loop().
It's a fair bit more complicated than that, but it'll do here.
Given that these patches increase the time it takes to reach qemu_main_loop()
(due to the early pinning work), this means that QMP will also be delayed for
this time.
I created a test that hammers QMP with commands until it's able to properly
service the request and recorded how long it took from guest start to when it
was able to fulfill the request:
* Before patches: 0.167s
* After patches: 4.080s
This aligns with time measured to reach qemu_main_loop() and the time we're
spending doing the early memory pinning.
All in all, the larger the amount of memory we need to pin, the longer it will
take for us to reach qemu_main_loop(), the larger time_spent_memory_pinning
will be, and thus the longer it will take for QMP to be ready and fully
functional.
----
I don't believe this related to any specific vhost-vDPA configuration. I think
bottom line is that if we're using a vhost-vDPA device, we'll be spending more
time to reach qemu_main_loop(), so QMP has to wait until we get there.
Let me circle back to my question: Under what circumstances is QMP
responsiveness affected?
The answer seems to be "only when we're using a vhost-vDPA device".
Correct?
Correct, since using one of these guys causes us to do this memory
pinning. If we're not using one, it's business as usual for Qemu.
We're using one exactly when QEMU is running with one of its
vhost-vdpa-device-pci* device models. Correct?
Yea, or something like:
-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,... \
-device virtio-net-pci,netdev=vhost-vdpa0,id=vdpa0,... \
* How is QMP responsiveness affected? Delay until the QMP greeting is
sent? Delay until the first command gets processed? Delay at some
later time?
Responsiveness: Longer initial delay due to early pinning work we need to do
before we can bring QMP up.
Greeting delay: No greeting delay. Greeting is flushed earlier, even before we
start the early pinning work.
* For both before and after the patches, this was ~0.052s for me.
Delay until first command processed: Longer initial delay at startup.
Delay at later time: None.
Got it.
* What's the absolute and the relative time of QMP non-responsiveness?
0.2s were mentioned. I'm looking for something like "when we're not
pinning, it takes 0.8s until the first QMP command is processed, and
when we are, it takes 1.0s".
The numbers below are based on my recent testing and measurements. This was
with a 50G guest with real vDPA hardware.
Before patches:
---------------
* From the start time of the guest to the earliest time QMP is able to process
a request (e.g. query-status): 0.167s.
> This timing is pretty much the same regardless of whether or not we're
pinning memory.
* Time spent pinning memory (QMP cannot handle requests during this window):
0.077s.
After patches:
--------------
* From the start time of the guest to the earliest time QMP is able to process
a request (e.g. query-status): 4.08s
> If we're not early pinning memory, it's ~0.167s.
* Time spent pinning memory *after entering qemu_main_loop()* (QMP cannot
handle requests during this window): 0.0065s.
Let me recap:
* No change at all unless we're pinning memory early, and we're doing
that only when we're using a vhost-vDPA device. Correct?
* If we are using a vhost-vDPA device:
- Total startup time (until we're done pinning) increases.
Correct.
- QMP becomes available later.
Correct.
- Main loop behavior improves: less guest-visible downtime, QMP more
responsive (once it's available)
Correct. Though the improvement is modest at best if we put aside the
guest-visible downtime improvement.
This is a tradeoff we want always. There is no need to let users pick
"faster startup, worse main loop behavior."
"Always" might be subjective here. For example, if there's no desire to
perform live migration, then the user kinda just gets stuck with the cons.
Whether or not we want to make this configurable though is another
discussion.
Correct?
[...]