On Mon, Feb 05, 2024 at 10:32:33AM +0000, Daniel P. Berrangé wrote:
> On Fri, Feb 02, 2024 at 05:53:39PM +0800, Peter Xu wrote:
> > On Thu, Feb 01, 2024 at 07:48:51PM +0100, Cédric Le Goater wrote:
> > > Hello,
> > 
> > Hi, Cédric,
> > 
> > Thanks for the patches.
> > 
> > > 
> > > Today, close_return_path_on_source() can perform a shutdown to exit
> > > the return-path thread if an error occured. However, migrate_fd_cleanup()
> > > does cleanups too early and the shutdown in close_return_path_on_source()
> > > fails, leaving the source and destination waiting for an event to occur.
> > > 
> > > This little series tries to fix that. Comments welcome !  
> > 
> > One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in
> > close_return_path_on_source() is weird: IMHO we have better way to detect
> > "whether the migration has error" now, which is migrate_has_error().
> > 
> > For this specific issue, I think one long standing issue that might be
> > relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share
> > the same QIOChannel now.  Logically the two QEMUFile should be able to be
> > managed separately, say, close() of to_dst_file shouldn't affect the other.
> > 
> > However I don't think it's the case now, as qemu_fclose(to_dst_file) will
> > do qio_channel_close() already, which means there will be a side effect to
> > the other QEMUFile that its backing IOC is already closed.
> > 
> > Is this the issue we're facing?  IOW, the close() of to_dst_file will not
> > properly kick the other thread who is blocked at reading from_dst_file,
> > while the shutdown() will kick it out?
> > 
> > If so, not sure whether we can somehow relay the real qio_channel_close()
> > to until the last user releases it? IOW, conditionally close() the channel
> > in qio_channel_finalize(), if the channel is still open?  Would that make
> > sense?
> 
> IMHO the problem described above is a result of the design mistake of
> having 2 separate QEMUFile instances for what is ultimately the same
> channel. This was a convenient approach to take originally, but it has
> likely outlived its purpose.
> 
> In the ideal world IMHO, QEMUFile would not exist at all, and we would
> have a QIOChannelCached that adds the read/write buffering above the
> base QIOChannel.

We have that in the TODO wiki page for a long time, I'll update it
slightly.

https://wiki.qemu.org/ToDo/LiveMigration#Rewrite_QEMUFile_for_migration

But yeah that might be too big a hammer to solve this specific issue.
AFAIU Fabiano is looking into that direction, but I assume it should still
be a long term thing.

> 
> That's doable, but bigger than a quick fix. A natural stepping stone
> to get there though is to move from 2 QEMUFile objs down to 1 QEMUFile,
> which might be more practical as a quick fix.

Agree. However would this still be quite some change?

We still have a lot of references on the four qemufiles (to/from_dst_file,
to/from_src_file), at least that'll need a replacement; I didn't yet
further check whether all places can be done with a direct replacement of
such change, some tweaks may be needed here and there, but shouldn't be
major.

Meanwhile IIUC it'll also need a major rework on QEMUFile, allowing it to
be bi-directional?  We may need to duplicate the cache layer, IIUC, one for
each direction IOs.

-- 
Peter Xu


Reply via email to