On Wed, Dec 04, 2024 at 04:02:36PM -0300, Fabiano Rosas wrote: > Peter Xu <pet...@redhat.com> writes: > > > On Mon, Dec 02, 2024 at 07:01:33PM -0300, Fabiano Rosas wrote: > >> Make sure postcopy threads are released when migrate_cancel is > >> issued. Kick the postcopy_pause semaphore and have the fault thread > >> read 'fault_thread_quit' when joining. > >> > >> While here fix the comment mentioning userfault_event_fd. > >> > >> Signed-off-by: Fabiano Rosas <faro...@suse.de> > > > > I remember when working on postcopy, I thought about failing migrate-cancel > > for postcopy in general, rejecting such request. And when working on the > > recover feature, there's no concern on having it being cancelled, because > > the user really shouldn't do that.. > > > > The problem is migrate-cancel means crashing the VM on both sides when QEMU > > already goes into postcopy stage. > > Well, that's the sillyness of having a cancel command, you can never > know what "cancel" means. The "documentation" says: "Cancel the current > executing migration process", it doesn't mention anything about the > consequences of such action.
We definitely need cancel. It was always used in precopy, and people use it a lot! > > > > > If the user wants to crash the VM anyway, an easier way to do is killing on > > both sides. > > I don't think this is fair. We expose a "cancel" command, we better do > some cancelling or instead reject the command appropriately, not expect > the user to "know better". That's exactly why we should fail it with a proper error message, IMHO. Because the user may not really understand the impact of postcopy. I remember I also tried to reuse migrate-cancel to work as what migrate-pause does currently (so as to not introduce yet another new QMP command). I don't remember why we came up with the new cmd, but with migrate-pause being there, I think the only sane way to respond to cancel request during postcopy is to fail it properly.. > > > > > If the user wished to cancel, we should tell them "postcopy cannot be > > cancelled, until complete". That's probably the major reason why people > > think postcopy is dangerous to use.. > > We could certainly add that restriction, I don't see a problem with > it. That said, what is the actual use case for migrate_cancel? And how > does that compare with yank? I feel like we've been kind of relying on > nobody using those commands really. We had "cancel" first, then "yank". I didn't remember who merged yank, especially for migration, and why it was ever needed. What I still remember is yank crash qemu migration quite frequently, though I can't blame it as it does help us to find quite a few mismatch on iochannel setup/cleanup or stuff like that. So it's sometimes "helpful" as kind of a program checker.. Migration users should have stick with "cancel" rather than "yank" - qmp "yank" would "FAIL" the migration instead of showing CANCELLED, definitely should avoid. I am not aware of anybody that uses "yank" for migration at all. So yeah, both commands are slightly duplicated, and if we want to throw one, it needs to be yank, not cancel. I'm fine keeping both.. > > One truth that we have (because it's tested) is that the multifd > migration allows migrate_cancel on the source and another migration to > start from it. > > (btw, that reminds me that multifd+postcopy will probably break that > test). When postcopy==true (even if multifd==true as well! Or not, it doesn't matter), we should reject the cancel request, IMHO. -- Peter Xu