On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote:
> On 29.04.2024 17:09, Peter Xu wrote:
> > On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
> > > On 24.04.2024 00:35, Peter Xu wrote:
> > > > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
> > > > > On 24.04.2024 00:20, Peter Xu wrote:
> > > > > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > On 19.04.2024 17:31, Peter Xu wrote:
> > > > > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. BerrangΓ© 
> > > > > > > > wrote:
> > > > > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
> > > > > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. 
> > > > > > > > > > Szmigiero wrote:
> > > > > > > > > > > I think one of the reasons for these results is that 
> > > > > > > > > > > mixed (RAM + device
> > > > > > > > > > > state) multifd channels participate in the RAM sync 
> > > > > > > > > > > process
> > > > > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated 
> > > > > > > > > > > channels don't.
> > > > > > > > > > 
> > > > > > > > > > Firstly, I'm wondering whether we can have better names for 
> > > > > > > > > > these new
> > > > > > > > > > hooks.  Currently (only comment on the async* stuff):
> > > > > > > > > > 
> > > > > > > > > >       - complete_precopy_async
> > > > > > > > > >       - complete_precopy
> > > > > > > > > >       - complete_precopy_async_wait
> > > > > > > > > > 
> > > > > > > > > > But perhaps better:
> > > > > > > > > > 
> > > > > > > > > >       - complete_precopy_begin
> > > > > > > > > >       - complete_precopy
> > > > > > > > > >       - complete_precopy_end
> > > > > > > > > > 
> > > > > > > > > > ?
> > > > > > > > > > 
> > > > > > > > > > As I don't see why the device must do something with async 
> > > > > > > > > > in such hook.
> > > > > > > > > > To me it's more like you're splitting one process into 
> > > > > > > > > > multiple, then
> > > > > > > > > > begin/end sounds more generic.
> > > > > > > > > > 
> > > > > > > > > > Then, if with that in mind, IIUC we can already split 
> > > > > > > > > > ram_save_complete()
> > > > > > > > > > into >1 phases too. For example, I would be curious whether 
> > > > > > > > > > the performance
> > > > > > > > > > will go back to normal if we offloading 
> > > > > > > > > > multifd_send_sync_main() into the
> > > > > > > > > > complete_precopy_end(), because we really only need one 
> > > > > > > > > > shot of that, and I
> > > > > > > > > > am quite surprised it already greatly affects VFIO dumping 
> > > > > > > > > > its own things.
> > > > > > > > > > 
> > > > > > > > > > I would even ask one step further as what Dan was asking: 
> > > > > > > > > > have you thought
> > > > > > > > > > about dumping VFIO states via multifd even during 
> > > > > > > > > > iterations?  Would that
> > > > > > > > > > help even more than this series (which IIUC only helps 
> > > > > > > > > > during the blackout
> > > > > > > > > > phase)?
> > > > > > > > > 
> > > > > > > > > To dump during RAM iteration, the VFIO device will need to 
> > > > > > > > > have
> > > > > > > > > dirty tracking and iterate on its state, because the guest 
> > > > > > > > > CPUs
> > > > > > > > > will still be running potentially changing VFIO state. That 
> > > > > > > > > seems
> > > > > > > > > impractical in the general case.
> > > > > > > > 
> > > > > > > > We already do such interations in vfio_save_iterate()?
> > > > > > > > 
> > > > > > > > My understanding is the recent VFIO work is based on the fact 
> > > > > > > > that the VFIO
> > > > > > > > device can track device state changes more or less (besides 
> > > > > > > > being able to
> > > > > > > > save/load full states).  E.g. I still remember in our QE tests 
> > > > > > > > some old
> > > > > > > > devices report much more dirty pages than expected during the 
> > > > > > > > iterations
> > > > > > > > when we were looking into such issue that a huge amount of 
> > > > > > > > dirty pages
> > > > > > > > reported.  But newer models seem to have fixed that and report 
> > > > > > > > much less.
> > > > > > > > 
> > > > > > > > That issue was about GPU not NICs, though, and IIUC a major 
> > > > > > > > portion of such
> > > > > > > > tracking used to be for GPU vRAMs.  So maybe I was mixing up 
> > > > > > > > these, and
> > > > > > > > maybe they work differently.
> > > > > > > 
> > > > > > > The device which this series was developed against (Mellanox 
> > > > > > > ConnectX-7)
> > > > > > > is already transferring its live state before the VM gets stopped 
> > > > > > > (via
> > > > > > > save_live_iterate SaveVMHandler).
> > > > > > > 
> > > > > > > It's just that in addition to the live state it has more than 400 
> > > > > > > MiB
> > > > > > > of state that cannot be transferred while the VM is still running.
> > > > > > > And that fact hurts a lot with respect to the migration downtime.
> > > > > > > 
> > > > > > > AFAIK it's a very similar story for (some) GPUs.
> > > > > > 
> > > > > > So during iteration phase VFIO cannot yet leverage the multifd 
> > > > > > channels
> > > > > > when with this series, am I right?
> > > > > 
> > > > > That's right.
> > > > > 
> > > > > > Is it possible to extend that use case too?
> > > > > 
> > > > > I guess so, but since this phase (iteration while the VM is still
> > > > > running)Β doesn't impact downtime it is much less critical.
> > > > 
> > > > But it affects the bandwidth, e.g. even with multifd enabled, the device
> > > > iteration data will still bottleneck at ~15Gbps on a common system setup
> > > > the best case, even if the hosts are 100Gbps direct connected.  Would 
> > > > that
> > > > be a concern in the future too, or it's known problem and it won't be 
> > > > fixed
> > > > anyway?
> > > 
> > > I think any improvements to the migration performance are good, even if
> > > they don't impact downtime.
> > > 
> > > It's just that this patch set focuses on the downtime phase as the more
> > > critical thing.
> > > 
> > > After this gets improved there's no reason why not to look at improving
> > > performance of the VM live phase too if it brings sensible improvements.
> > > 
> > > > I remember Avihai used to have plan to look into similar issues, I hope
> > > > this is exactly what he is looking for.  Otherwise changing migration
> > > > protocol from time to time is cumbersome; we always need to provide a 
> > > > flag
> > > > to make sure old systems migrates in the old ways, new systems run the 
> > > > new
> > > > ways, and for such a relatively major change I'd want to double check on
> > > > how far away we can support offload VFIO iterations data to multifd.
> > > 
> > > The device state transfer is indicated by a new flag in the multifd
> > > header (MULTIFD_FLAG_DEVICE_STATE).
> > > 
> > > If we are to use multifd channels for VM live phase transfers these
> > > could simply re-use the same flag type.
> > 
> > Right, and that's also my major purpose of such request to consider both
> > issues.
> > 
> > If supporting iterators can be easy on top of this, I am thinking whether
> > we should do this in one shot.  The problem is even if the flag type can be
> > reused, old/new qemu binaries may not be compatible and may not migrate
> > well when:
> > 
> >    - The old qemu only supports the downtime optimizations
> >    - The new qemu supports both downtime + iteration optimizations
> 
> I think the situation here will be the same as with any new flag
> affecting the migration wire protocol - if the old version of QEMU
> doesn't support that flag then it has to be kept at its backward-compatible
> setting for migration to succeed.
> 
> > IIUC, at least the device threads are currently created only at the end of
> > migration when switching over for the downtime-only optimization (aka, this
> > series).  Then it means it won't be compatible with a new QEMU as the
> > threads there will need to be created before iteration starts to take
> > iteration data.  So I believe we'll need yet another flag to tune the
> > behavior of such, one for each optimizations (downtime v.s. data during
> > iterations).  If they work mostly similarly, I want to avoid two flags.
> > It'll be chaos for user to see such similar flags and they'll be pretty
> > confusing.
> 
> The VFIO loading threads are created from vfio_load_setup(), which is
> called at the very beginning of the migration, so they should be already
> there.
> 
> However, they aren't currently prepared to receive VM live phase data.
> 
> > If possible, I wish we can spend some time looking into that if they're so
> > close, and if it's low hanging fruit when on top of this series, maybe we
> > can consider doing that in one shot.
> 
> I'm still trying to figure out the complete explanation why dedicated
> device state channels improve downtime as there was a bunch of holidays
> last week here.

No rush.  I am not sure whether it'll reduce downtime, but it may improve
total migration time when multiple devices are used.

> 
> I will have a look later what would it take to add VM live phase multifd
> device state transfer support and also how invasive it would be as I
> think it's better to keep the number of code conflicts in a patch set
> to a manageable size as it reduces the chance of accidentally
> introducing regressions when forward-porting the patch set to the git master.

Yes it makes sense.  It'll be good to look one step further in this case,
then:

  - If it's easy to add support then we do in one batch, or

  - If it's not easy to add support, but if we can find a compatible way so
    that ABI can be transparent when adding that later, it'll be also nice, or
    
  - If we have solid clue it should be a major separate work, and we must
    need a new flag, then we at least know we should simply split the
    effort due to that complexity

The hope is option (1)/(2) would work out.

I hope Avihai can also chim in here (or please reach him out) because I
remember he used to consider proposing such a whole solution, but maybe I
just misunderstood.  I suppose no harm to check with him.

Thanks,

-- 
Peter Xu


Reply via email to