On Sun, May 24, 2026 at 09:58:57AM +0300, Avihai Horon wrote: > > On 5/21/2026 6:20 PM, Peter Xu wrote: > > External email: Use caution opening links or attachments > > > > > > On Thu, May 21, 2026 at 04:53:54PM +0300, Avihai Horon wrote: > > > On 5/19/2026 11:09 PM, Peter Xu wrote: > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > On Tue, May 05, 2026 at 11:14:09AM +0300, Avihai Horon wrote: > > > > > Performance tests were done by migrating a single VM with: > > > > > * 8 GB RAM > > > > > * 4 mlx5 VFIO devices: > > > > > - One device with 1GB of device data (stopcopy data) that runs > > > > > workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised > > > > > (generate new initial_bytes chunks during precopy). > > > > Could you elaborate a bit more on what workload is executed, and how > > > > that > > > > will affect REINIT reportings (e.g. is only one REINIT generated, or it > > > > keeps generating)? > > > Basically, I create and destroy RDMA resources (MRs, QPs, CQs, etc.) on > > > the > > > VFIO device in a loop for several iterations. > > > This generates several REINITs. > > > > > > > Can I understand it in this way: without REINIT, device is forced to put > > > > those data into stopcopy size; then with REINIT, some stopcopy size is > > > > essentially moved back to precopy phase? > > > Almost: > > > Without REINIT, the device is forced to put this data in precopy > > > dirty_bytes. > > > With REINIT, this data can be put in precopy init_bytes (and do the > > > switchover-ack dance again). > > Hmm, then I don't understand why moving some chunk of data from > > precopy_bytes to init_bytes helps downtime. > > > > Essentially, QEMU makes the switchover decision based on the math of: > > > > init+dirty+stop > > --------------- <= downtime_limit > > bw > > > > The possible min of above is: > > > > stop > > --------------- > > bw > > > > Here whether some data would be in init or precopy portion shouldn't matter > > for a min downtime, since both portions are allowed to be moved during > > precopy phase. > > > > OTOH, if stop_bytes unchanged, min downtime is still the same before / > > after supporting REINIT, if we try harder. > > > > Say, with below testing results: > > > > With VFIO_PRECOPY_INFO_REINIT: > > 1335ms total (~520ms from the VFIO device running the workload). > > > > Without VFIO_PRECOPY_INFO_REINIT: > > 2352ms total (~1600ms from the VFIO device running the workload). > > > > What is the downtime_limit you specified for both cases? Have you tried to > > specify lower downtime_limit than what you specified, so that both results > > will become even closer (until they become, statistically, identical)? > > > > In general, I can understand the REINIT will stop converging too early, but > > it'll be the same IIUC just to turn the downtime_limit smaller.. IOW, I > > may still miss some important piece of info that how this REINIT feature > > helps downtime.. > > The init_bytes are special in the sense that it's crucial that they are > transferred before switching over. Otherwise, VFIO precopy may not have full > effect which could make VFIO migration slower. > Accordingly, their contribution to downtime may not be just the time it > takes to transfer them. > > Specifically for mlx5, init_bytes hold a small portion of metadata used for > time consuming pre-allocations on destination side. So, we may have 10MB of > init_bytes which would take a fraction of a second to transfer, but once > reached destination, it could take even a few seconds to load them. > > When moving this data from dirty_bytes to init_bytes along with > switchover-ack, we guarantee that this long pre-allocation doesn't happen > during downtime. This is the time difference you see in the test results.
I see, yes that's what I missed and this makes a lot sense, thanks. Could you put above example into some vfio.rst doc when repost, when describing REINIT feature? Thanks, -- Peter Xu
