On Mon, Feb 05, 2024 at 10:32:33AM +0000, Daniel P. Berrangé wrote: > On Fri, Feb 02, 2024 at 05:53:39PM +0800, Peter Xu wrote: > > On Thu, Feb 01, 2024 at 07:48:51PM +0100, Cédric Le Goater wrote: > > > Hello, > > > > Hi, Cédric, > > > > Thanks for the patches. > > > > > > > > Today, close_return_path_on_source() can perform a shutdown to exit > > > the return-path thread if an error occured. However, migrate_fd_cleanup() > > > does cleanups too early and the shutdown in close_return_path_on_source() > > > fails, leaving the source and destination waiting for an event to occur. > > > > > > This little series tries to fix that. Comments welcome ! > > > > One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in > > close_return_path_on_source() is weird: IMHO we have better way to detect > > "whether the migration has error" now, which is migrate_has_error(). > > > > For this specific issue, I think one long standing issue that might be > > relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share > > the same QIOChannel now. Logically the two QEMUFile should be able to be > > managed separately, say, close() of to_dst_file shouldn't affect the other. > > > > However I don't think it's the case now, as qemu_fclose(to_dst_file) will > > do qio_channel_close() already, which means there will be a side effect to > > the other QEMUFile that its backing IOC is already closed. > > > > Is this the issue we're facing? IOW, the close() of to_dst_file will not > > properly kick the other thread who is blocked at reading from_dst_file, > > while the shutdown() will kick it out? > > > > If so, not sure whether we can somehow relay the real qio_channel_close() > > to until the last user releases it? IOW, conditionally close() the channel > > in qio_channel_finalize(), if the channel is still open? Would that make > > sense? > > IMHO the problem described above is a result of the design mistake of > having 2 separate QEMUFile instances for what is ultimately the same > channel. This was a convenient approach to take originally, but it has > likely outlived its purpose. > > In the ideal world IMHO, QEMUFile would not exist at all, and we would > have a QIOChannelCached that adds the read/write buffering above the > base QIOChannel.
We have that in the TODO wiki page for a long time, I'll update it slightly. https://wiki.qemu.org/ToDo/LiveMigration#Rewrite_QEMUFile_for_migration But yeah that might be too big a hammer to solve this specific issue. AFAIU Fabiano is looking into that direction, but I assume it should still be a long term thing. > > That's doable, but bigger than a quick fix. A natural stepping stone > to get there though is to move from 2 QEMUFile objs down to 1 QEMUFile, > which might be more practical as a quick fix. Agree. However would this still be quite some change? We still have a lot of references on the four qemufiles (to/from_dst_file, to/from_src_file), at least that'll need a replacement; I didn't yet further check whether all places can be done with a direct replacement of such change, some tweaks may be needed here and there, but shouldn't be major. Meanwhile IIUC it'll also need a major rework on QEMUFile, allowing it to be bi-directional? We may need to duplicate the cache layer, IIUC, one for each direction IOs. -- Peter Xu