Jonah Palmer <jonah.pal...@oracle.com> writes:

> On 6/26/25 8:08 AM, Markus Armbruster wrote:

[...]

> Apologies for the delay in getting back to you. I just wanted to be thorough 
> and answer everything as accurately and clearly as possible.
>
> ----
>
> Before these patches, pinning started in vhost_vdpa_dev_start(), where the 
> memory listener was registered, and began calling 
> vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This 
> happens after entering qemu_main_loop().
>
> After these patches, pinning started in vhost_dev_init() (specifically 
> vhost_vdpa_set_owner()), where the memory listener registration was moved to. 
> This happens *before* entering qemu_main_loop().
>
> However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The 
> pinning that happens before we enter qemu_main_loop() is the full guest RAM 
> pinning, which is the main, heavy lifting work when it comes to pinning 
> memory.
>
> The rest of the pinning work happens after entering qemu_main_loop() 
> (approximately around the same timing as when pinning started before these 
> patches). But, since we already did the heavy lifting of the pinning work pre 
> qemu_main_loop() (e.g. all pages were already allocated and pinned), we're 
> just re-pinning here (i.e. kernel just updates its IOTLB tables for pages 
> that're already mapped and locked in RAM).
>
> This makes the pinning work we do after entering qemu_main_loop() much faster 
> than compared to the same pinning we had to do before these patches.
>
> However, we have to pay a cost for this. Because we do the heavy lifting work 
> earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the 
> guest hasn't yet touched its memory yet, all host pages are still anonymous 
> and unallocated. This essentially means that doing the pinning earlier is 
> more expensive time-wise given that we need to also allocate physical pages 
> for each chunk of memory.
>
> To (hopefully) show this more clearly, I ran some tests before and after 
> these patches and averaged the results. I used a 50G guest with real vDPA 
> hardware (Mellanox CX-6Dx):
>
> 0.) How many vhost_vdpa_listener_region_add() (pins) calls?
>
>                | Total | Before qemu_main_loop | After qemu_main_loop
> _____________________________________________________________________
> Before patches |   6   |         0             |         6
> ---------------|-----------------------------------------------------
> After patches  |   11  |         5           |         6
>
> - After the patches, this looks like we doubled the work we're doing (given 
> the extra 5 calls), however, the 6 calls that happen after entering 
> qemu_main_loop() are essentially replays of the first 5 we did.
>
>  * In other words, after the patches, the 6 calls made after entering 
> qemu_main_loop() are performed much faster than the same 6 calls before the 
> patches.
>
>  * From my measurements, these are the timings it took to perform those 6 
> calls after entering qemu_main_loop():
>    > Before patches: 0.0770s
>    > After patches:  0.0065s
>
> ---
>
> 1.) Time from starting the guest to entering qemu_main_loop():
>  * Before patches: 0.112s
>  * After patches:  3.900s
>
> - This is due to the 5 early pins we're doing now with these patches, whereas 
> before we never did any pinning work at all.
>
> - From measuring the time between the first and last 
> vhost_vdpa_listener_region_add() calls during this period, this comes out to 
> ~3s for the early pinning.

So, total time increases: early pinning (before main loop) takes more
time than we save pinning (in the main loop).  Correct?

We want this trade, because the time spent in the main loop is a
problem: guest-visible downtime.  Correct?

[...]

>> Let's see whether I understand...  Please correct my mistakes.
>> 
>> Memory pinning takes several seconds for large guests.
>> 
>> Your patch makes pinning much slower.  You're theorizing this is because
>> pinning cold memory is slower than pinning warm memory.
>> 
>> I suppose the extra time is saved elsewhere, i.e. the entire startup
>> time remains roughly the same.  Have you verified this experimentally?
>
> Based on my measurements that I did, we pay a ~3s increase in initialization 
> time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning 
> earlier for a vhost-vDPA device. This resulted in:
>
> * Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s).
>
> * Shorter downtime phase during live migration (see below).
>
> * Slight increase in time for the device to be operational (e.g. guest sets 
> DRIVER_OK).
>   > This measured the start time of the guest to guest setting DRIVER_OK for 
> the device:
>
>     Before patches: 22.46s
>     After patches:  23.40s
>
> The real timesaver here is the guest-visisble downtime during live migration 
> (when using a vhost-vDPA device). Since the heavy lifting of the memory 
> pinning is done during the initialization phase, it's no longer included as 
> part of the stop-and-copy phase, which results in a much shorter 
> guest-visible downtime.
>
> From v5's CV:
>
> Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs,
> the series reduces guest-visible downtime during back-to-back live
> migrations by more than half:
> - 39G VM:   4.72s -> 2.09s (-2.63s, ~56% improvement)
> - 128G VM:  14.72s -> 5.83s (-8.89s, ~60% improvement)
>
> Essentially, we pay a slight increased startup time tax to buy ourselves a 
> much shorter downtime window when we want to perform a live migration with a 
> vhost-vDPA networking device.
>
>> Your stated reason for moving the pinning is moving it from within
>> migration downtime to before migration downtime.  I understand why
>> that's useful.
>> 
>> You mentioned "a small drawback [...] a time in the initialization where
>> QEMU cannot respond to QMP".  Here's what I've been trying to figure out
>> about this drawback since the beginning:
>> 
>> * Under what circumstances is QMP responsiveness affected?  I *guess*
>>   it's only when we start a guest with plenty of memory and a certain
>>   vhost-vdpa configuration.  What configuration exactly?
>> 
>
> Regardless of these patches, as I understand it, QMP cannot actually run any 
> command that requires the BQL while we're pinning memory (memory pinning 
> needs to use the lock).
>
> However, the BQL is not held during the entirety of the pinning process. That 
> is, it's periodically released throughout the entire pinning process. But 
> those windows are *very* short and are only caught if you're really hammering 
> QMP with commands very rapidly.
>
> From a realistic point of view, it's more practical to think of QMP being 
> fully ready once all pinning has finished, e.g. time_spent_memory_pinning ≈ 
> time_QMP_is_blocked.
>
> ---
>
> As I understand it, QMP is not fully ready and cannot service requests until 
> early on in qemu_main_loop().

It's a fair bit more complicated than that, but it'll do here.

> Given that these patches increase the time it takes to reach qemu_main_loop() 
> (due to the early pinning work), this means that QMP will also be delayed for 
> this time.
>
> I created a test that hammers QMP with commands until it's able to properly 
> service the request and recorded how long it took from guest start to when it 
> was able to fulfill the request:
>  * Before patches: 0.167s
>  * After patches:  4.080s
>
> This aligns with time measured to reach qemu_main_loop() and the time we're 
> spending doing the early memory pinning.
>
> All in all, the larger the amount of memory we need to pin, the longer it 
> will take for us to reach qemu_main_loop(), the larger 
> time_spent_memory_pinning will be, and thus the longer it will take for QMP 
> to be ready and fully functional.
>
> ----
>
> I don't believe this related to any specific vhost-vDPA configuration. I 
> think bottom line is that if we're using a vhost-vDPA device, we'll be 
> spending more time to reach qemu_main_loop(), so QMP has to wait until we get 
> there.

Let me circle back to my question: Under what circumstances is QMP
responsiveness affected?

The answer seems to be "only when we're using a vhost-vDPA device".
Correct?

We're using one exactly when QEMU is running with one of its
vhost-vdpa-device-pci* device models.  Correct?

>> * How is QMP responsiveness affected?  Delay until the QMP greeting is
>>   sent?  Delay until the first command gets processed?  Delay at some
>>   later time?
>> 
>
> Responsiveness: Longer initial delay due to early pinning work we need to do 
> before we can bring QMP up.
>
> Greeting delay: No greeting delay. Greeting is flushed earlier, even before 
> we start the early pinning work.
>
> * For both before and after the patches, this was ~0.052s for me.
>
> Delay until first command processed: Longer initial delay at startup.
>
> Delay at later time: None.

Got it.

>> * What's the absolute and the relative time of QMP non-responsiveness?
>>   0.2s were mentioned.  I'm looking for something like "when we're not
>>   pinning, it takes 0.8s until the first QMP command is processed, and
>>   when we are, it takes 1.0s".
>> 
>
> The numbers below are based on my recent testing and measurements. This was 
> with a 50G guest with real vDPA hardware.
>
> Before patches:
> ---------------
> * From the start time of the guest to the earliest time QMP is able to 
> process a request (e.g. query-status): 0.167s.
>   > This timing is pretty much the same regardless of whether or not we're 
> pinning memory.
>
> * Time spent pinning memory (QMP cannot handle requests during this window): 
> 0.077s.
>
> After patches:
> --------------
> * From the start time of the guest to the earliest time QMP is able to 
> process a request (e.g. query-status): 4.08s
>   > If we're not early pinning memory, it's ~0.167s.
>
> * Time spent pinning memory *after entering qemu_main_loop()* (QMP cannot 
> handle requests during this window): 0.0065s.

Let me recap:

* No change at all unless we're pinning memory early, and we're doing
  that only when we're using a vhost-vDPA device.  Correct?

* If we are using a vhost-vDPA device:

  - Total startup time (until we're done pinning) increases.

  - QMP becomes available later.

  - Main loop behavior improves: less guest-visible downtime, QMP more
    responsive (once it's available)

  This is a tradeoff we want always.  There is no need to let users pick
  "faster startup, worse main loop behavior."

Correct?

[...]


Reply via email to