On Sun, May 24, 2026 at 09:58:57AM +0300, Avihai Horon wrote:
> 
> On 5/21/2026 6:20 PM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Thu, May 21, 2026 at 04:53:54PM +0300, Avihai Horon wrote:
> > > On 5/19/2026 11:09 PM, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > On Tue, May 05, 2026 at 11:14:09AM +0300, Avihai Horon wrote:
> > > > > Performance tests were done by migrating a single VM with:
> > > > > * 8 GB RAM
> > > > > * 4 mlx5 VFIO devices:
> > > > >     - One device with 1GB of device data (stopcopy data) that runs
> > > > >       workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised
> > > > >       (generate new initial_bytes chunks during precopy).
> > > > Could you elaborate a bit more on what workload is executed, and how 
> > > > that
> > > > will affect REINIT reportings (e.g. is only one REINIT generated, or it
> > > > keeps generating)?
> > > Basically, I create and destroy RDMA resources (MRs, QPs, CQs, etc.) on 
> > > the
> > > VFIO device in a loop for several iterations.
> > > This generates several REINITs.
> > > 
> > > > Can I understand it in this way: without REINIT, device is forced to put
> > > > those data into stopcopy size; then with REINIT, some stopcopy size is
> > > > essentially moved back to precopy phase?
> > > Almost:
> > > Without REINIT, the device is forced to put this data in precopy
> > > dirty_bytes.
> > > With REINIT, this data can be put in precopy init_bytes (and do the
> > > switchover-ack dance again).
> > Hmm, then I don't understand why moving some chunk of data from
> > precopy_bytes to init_bytes helps downtime.
> > 
> > Essentially, QEMU makes the switchover decision based on the math of:
> > 
> >     init+dirty+stop
> >     --------------- <= downtime_limit
> >           bw
> > 
> > The possible min of above is:
> > 
> >          stop
> >     ---------------
> >           bw
> > 
> > Here whether some data would be in init or precopy portion shouldn't matter
> > for a min downtime, since both portions are allowed to be moved during
> > precopy phase.
> > 
> > OTOH, if stop_bytes unchanged, min downtime is still the same before /
> > after supporting REINIT, if we try harder.
> > 
> > Say, with below testing results:
> > 
> > With VFIO_PRECOPY_INFO_REINIT:
> >    1335ms total (~520ms from the VFIO device running the workload).
> > 
> > Without VFIO_PRECOPY_INFO_REINIT:
> >    2352ms total (~1600ms from the VFIO device running the workload).
> > 
> > What is the downtime_limit you specified for both cases?  Have you tried to
> > specify lower downtime_limit than what you specified, so that both results
> > will become even closer (until they become, statistically, identical)?
> > 
> > In general, I can understand the REINIT will stop converging too early, but
> > it'll be the same IIUC just to turn the downtime_limit smaller..  IOW, I
> > may still miss some important piece of info that how this REINIT feature
> > helps downtime..
> 
> The init_bytes are special in the sense that it's crucial that they are
> transferred before switching over. Otherwise, VFIO precopy may not have full
> effect which could make VFIO migration slower.
> Accordingly, their contribution to downtime may not be just the time it
> takes to transfer them.
> 
> Specifically for mlx5, init_bytes hold a small portion of metadata used for
> time consuming pre-allocations on destination side. So, we may have 10MB of
> init_bytes which would take a fraction of a second to transfer, but once
> reached destination, it could take even a few seconds to load them.
> 
> When moving this data from dirty_bytes to init_bytes along with
> switchover-ack, we guarantee that this long pre-allocation doesn't happen
> during downtime. This is the time difference you see in the test results.

I see, yes that's what I missed and this makes a lot sense, thanks.

Could you put above example into some vfio.rst doc when repost, when
describing REINIT feature?

Thanks,

-- 
Peter Xu


Reply via email to