Jonah Palmer <jonah.pal...@oracle.com> writes: > On 6/26/25 8:08 AM, Markus Armbruster wrote:
[...] > Apologies for the delay in getting back to you. I just wanted to be thorough > and answer everything as accurately and clearly as possible. > > ---- > > Before these patches, pinning started in vhost_vdpa_dev_start(), where the > memory listener was registered, and began calling > vhost_vdpa_listener_region_add() to invoke the actual memory pinning. This > happens after entering qemu_main_loop(). > > After these patches, pinning started in vhost_dev_init() (specifically > vhost_vdpa_set_owner()), where the memory listener registration was moved to. > This happens *before* entering qemu_main_loop(). > > However, the entirety of pinning doesn't all happen pre qemu_main_loop(). The > pinning that happens before we enter qemu_main_loop() is the full guest RAM > pinning, which is the main, heavy lifting work when it comes to pinning > memory. > > The rest of the pinning work happens after entering qemu_main_loop() > (approximately around the same timing as when pinning started before these > patches). But, since we already did the heavy lifting of the pinning work pre > qemu_main_loop() (e.g. all pages were already allocated and pinned), we're > just re-pinning here (i.e. kernel just updates its IOTLB tables for pages > that're already mapped and locked in RAM). > > This makes the pinning work we do after entering qemu_main_loop() much faster > than compared to the same pinning we had to do before these patches. > > However, we have to pay a cost for this. Because we do the heavy lifting work > earlier pre qemu_main_loop(), we're pinning with cold memory. That is, the > guest hasn't yet touched its memory yet, all host pages are still anonymous > and unallocated. This essentially means that doing the pinning earlier is > more expensive time-wise given that we need to also allocate physical pages > for each chunk of memory. > > To (hopefully) show this more clearly, I ran some tests before and after > these patches and averaged the results. I used a 50G guest with real vDPA > hardware (Mellanox CX-6Dx): > > 0.) How many vhost_vdpa_listener_region_add() (pins) calls? > > | Total | Before qemu_main_loop | After qemu_main_loop > _____________________________________________________________________ > Before patches | 6 | 0 | 6 > ---------------|----------------------------------------------------- > After patches | 11 | 5 | 6 > > - After the patches, this looks like we doubled the work we're doing (given > the extra 5 calls), however, the 6 calls that happen after entering > qemu_main_loop() are essentially replays of the first 5 we did. > > * In other words, after the patches, the 6 calls made after entering > qemu_main_loop() are performed much faster than the same 6 calls before the > patches. > > * From my measurements, these are the timings it took to perform those 6 > calls after entering qemu_main_loop(): > > Before patches: 0.0770s > > After patches: 0.0065s > > --- > > 1.) Time from starting the guest to entering qemu_main_loop(): > * Before patches: 0.112s > * After patches: 3.900s > > - This is due to the 5 early pins we're doing now with these patches, whereas > before we never did any pinning work at all. > > - From measuring the time between the first and last > vhost_vdpa_listener_region_add() calls during this period, this comes out to > ~3s for the early pinning. So, total time increases: early pinning (before main loop) takes more time than we save pinning (in the main loop). Correct? We want this trade, because the time spent in the main loop is a problem: guest-visible downtime. Correct? [...] >> Let's see whether I understand... Please correct my mistakes. >> >> Memory pinning takes several seconds for large guests. >> >> Your patch makes pinning much slower. You're theorizing this is because >> pinning cold memory is slower than pinning warm memory. >> >> I suppose the extra time is saved elsewhere, i.e. the entire startup >> time remains roughly the same. Have you verified this experimentally? > > Based on my measurements that I did, we pay a ~3s increase in initialization > time (pre qemu_main_loop()) to handle the heavy lifting of the memory pinning > earlier for a vhost-vDPA device. This resulted in: > > * Faster memory pinning during qemu_main_loop() (0.0770s vs 0.0065s). > > * Shorter downtime phase during live migration (see below). > > * Slight increase in time for the device to be operational (e.g. guest sets > DRIVER_OK). > > This measured the start time of the guest to guest setting DRIVER_OK for > the device: > > Before patches: 22.46s > After patches: 23.40s > > The real timesaver here is the guest-visisble downtime during live migration > (when using a vhost-vDPA device). Since the heavy lifting of the memory > pinning is done during the initialization phase, it's no longer included as > part of the stop-and-copy phase, which results in a much shorter > guest-visible downtime. > > From v5's CV: > > Using ConnectX-6 Dx (MLX5) NICs in vhost-vDPA mode with 8 queue-pairs, > the series reduces guest-visible downtime during back-to-back live > migrations by more than half: > - 39G VM: 4.72s -> 2.09s (-2.63s, ~56% improvement) > - 128G VM: 14.72s -> 5.83s (-8.89s, ~60% improvement) > > Essentially, we pay a slight increased startup time tax to buy ourselves a > much shorter downtime window when we want to perform a live migration with a > vhost-vDPA networking device. > >> Your stated reason for moving the pinning is moving it from within >> migration downtime to before migration downtime. I understand why >> that's useful. >> >> You mentioned "a small drawback [...] a time in the initialization where >> QEMU cannot respond to QMP". Here's what I've been trying to figure out >> about this drawback since the beginning: >> >> * Under what circumstances is QMP responsiveness affected? I *guess* >> it's only when we start a guest with plenty of memory and a certain >> vhost-vdpa configuration. What configuration exactly? >> > > Regardless of these patches, as I understand it, QMP cannot actually run any > command that requires the BQL while we're pinning memory (memory pinning > needs to use the lock). > > However, the BQL is not held during the entirety of the pinning process. That > is, it's periodically released throughout the entire pinning process. But > those windows are *very* short and are only caught if you're really hammering > QMP with commands very rapidly. > > From a realistic point of view, it's more practical to think of QMP being > fully ready once all pinning has finished, e.g. time_spent_memory_pinning ≈ > time_QMP_is_blocked. > > --- > > As I understand it, QMP is not fully ready and cannot service requests until > early on in qemu_main_loop(). It's a fair bit more complicated than that, but it'll do here. > Given that these patches increase the time it takes to reach qemu_main_loop() > (due to the early pinning work), this means that QMP will also be delayed for > this time. > > I created a test that hammers QMP with commands until it's able to properly > service the request and recorded how long it took from guest start to when it > was able to fulfill the request: > * Before patches: 0.167s > * After patches: 4.080s > > This aligns with time measured to reach qemu_main_loop() and the time we're > spending doing the early memory pinning. > > All in all, the larger the amount of memory we need to pin, the longer it > will take for us to reach qemu_main_loop(), the larger > time_spent_memory_pinning will be, and thus the longer it will take for QMP > to be ready and fully functional. > > ---- > > I don't believe this related to any specific vhost-vDPA configuration. I > think bottom line is that if we're using a vhost-vDPA device, we'll be > spending more time to reach qemu_main_loop(), so QMP has to wait until we get > there. Let me circle back to my question: Under what circumstances is QMP responsiveness affected? The answer seems to be "only when we're using a vhost-vDPA device". Correct? We're using one exactly when QEMU is running with one of its vhost-vdpa-device-pci* device models. Correct? >> * How is QMP responsiveness affected? Delay until the QMP greeting is >> sent? Delay until the first command gets processed? Delay at some >> later time? >> > > Responsiveness: Longer initial delay due to early pinning work we need to do > before we can bring QMP up. > > Greeting delay: No greeting delay. Greeting is flushed earlier, even before > we start the early pinning work. > > * For both before and after the patches, this was ~0.052s for me. > > Delay until first command processed: Longer initial delay at startup. > > Delay at later time: None. Got it. >> * What's the absolute and the relative time of QMP non-responsiveness? >> 0.2s were mentioned. I'm looking for something like "when we're not >> pinning, it takes 0.8s until the first QMP command is processed, and >> when we are, it takes 1.0s". >> > > The numbers below are based on my recent testing and measurements. This was > with a 50G guest with real vDPA hardware. > > Before patches: > --------------- > * From the start time of the guest to the earliest time QMP is able to > process a request (e.g. query-status): 0.167s. > > This timing is pretty much the same regardless of whether or not we're > pinning memory. > > * Time spent pinning memory (QMP cannot handle requests during this window): > 0.077s. > > After patches: > -------------- > * From the start time of the guest to the earliest time QMP is able to > process a request (e.g. query-status): 4.08s > > If we're not early pinning memory, it's ~0.167s. > > * Time spent pinning memory *after entering qemu_main_loop()* (QMP cannot > handle requests during this window): 0.0065s. Let me recap: * No change at all unless we're pinning memory early, and we're doing that only when we're using a vhost-vDPA device. Correct? * If we are using a vhost-vDPA device: - Total startup time (until we're done pinning) increases. - QMP becomes available later. - Main loop behavior improves: less guest-visible downtime, QMP more responsive (once it's available) This is a tradeoff we want always. There is no need to let users pick "faster startup, worse main loop behavior." Correct? [...]