Re: [PATCH 2/6] migration: Kick postcopy threads on cancel

Fabiano Rosas Wed, 04 Dec 2024 12:40:45 -0800

Daniel P. Berrangé <berra...@redhat.com> writes:

> On Wed, Dec 04, 2024 at 02:39:12PM -0500, Peter Xu wrote:
>> On Wed, Dec 04, 2024 at 04:02:36PM -0300, Fabiano Rosas wrote:
>> > Peter Xu <pet...@redhat.com> writes:
>> > 
>> > > On Mon, Dec 02, 2024 at 07:01:33PM -0300, Fabiano Rosas wrote:
>> > >> Make sure postcopy threads are released when migrate_cancel is
>> > >> issued. Kick the postcopy_pause semaphore and have the fault thread
>> > >> read 'fault_thread_quit' when joining.
>> > >> 
>> > >> While here fix the comment mentioning userfault_event_fd.
>> > >> 
>> > >> Signed-off-by: Fabiano Rosas <faro...@suse.de>
>> > >
>> > > I remember when working on postcopy, I thought about failing 
>> > > migrate-cancel
>> > > for postcopy in general, rejecting such request.  And when working on the
>> > > recover feature, there's no concern on having it being cancelled, because
>> > > the user really shouldn't do that..
>> > >
>> > > The problem is migrate-cancel means crashing the VM on both sides when 
>> > > QEMU
>> > > already goes into postcopy stage.
>> > 
>> > Well, that's the sillyness of having a cancel command, you can never
>> > know what "cancel" means. The "documentation" says: "Cancel the current
>> > executing migration process", it doesn't mention anything about the
>> > consequences of such action.
>> 
>> We definitely need cancel.  It was always used in precopy, and people use
>> it a lot!


To be clear, I'm not arguing against cancel. I'm just pointing out that
it's silly because it's just like pressing C-c in the shell in the
middle of something. What's the expected end state? Completely
unspecified. I don't find it at all "elegant" that we treat cancel like
error and just let the code carry on stumbling and exit
eventually. Because then we have this C-c arriving at random moments in
the middle of stuff. The way we do "exiting" in multifd is way more
maintainable. If that flag is set, then let's exit, otherwise everything
should work.

Note that in some parts of this series I'm checking the CANCELLING
state, which aside from being prone to be racy, is a step in a different
direction than the way we've implemented cancel so far (shutdown and let
the code exit vs. explicitly check against the CANCELLING state).

Hope I'm being coherent, I'm not sure.

>
> Not a fair benchmark though.
>
> People use cancel alot because 'precopy' cannot complete in a
> predictable amount of time, any time guesstime can suddenly
> get much worse if the guest dirties memory differently. So
> people give up and cancel it after waiting ridiculously long
> and never being sure if it is nearing the finish.
>
> Once a migrate has been switched to 'postcopy' phase, however,
> we have what should be a highly predictable completion time,
> directly related to the amount of untransferred pages. That
> time should not get worse. The amount of time spent in the
> 'postcopy' phase will depend on how long you let the migrate
> run in 'precopy' before flipping to 'postcopy'
>
> IOW, I think there's a reasonable case to be made that NOT
> having the ability to cancel while in 'postcopy' phase would
> be mostly acceptable. You give up the ability to cancel for
> a while, in exchange for a clearly determined completion
> time. 

Some of these words could be in the documentation for cancel to make it
clear what the expectations are.

>
>> > > If the user wants to crash the VM anyway, an easier way to do is killing 
>> > > on
>> > > both sides.
>> > 
>> > I don't think this is fair. We expose a "cancel" command, we better do
>> > some cancelling or instead reject the command appropriately, not expect
>> > the user to "know better".
>> 
>> That's exactly why we should fail it with a proper error message, IMHO.
>> Because the user may not really understand the impact of postcopy.
>
> Yep, I think users/apps expect "cancel" to be safe. So if it can't be
> safe at certain times, we should reject it in those time windows.

Which kind of implies we should test this by migrate->cancel->migrate,
like the multifd test does, and not like this series does. I've been
focusing more on catching crashes/hangs.

>
>> > > If the user wished to cancel, we should tell them "postcopy cannot be
>> > > cancelled, until complete".  That's probably the major reason why people
>> > > think postcopy is dangerous to use..
>> > 
>> > We could certainly add that restriction, I don't see a problem with
>> > it. That said, what is the actual use case for migrate_cancel? And how
>> > does that compare with yank? I feel like we've been kind of relying on
>> > nobody using those commands really.
>> 
>> We had "cancel" first, then "yank".  I didn't remember who merged yank,
>> especially for migration, and why it was ever needed.
>
> yank is for the case where the network connections are completely stuck,
> causing QEMU to potentially get stalled in I/O operations until a TCP
> timeout is reached. yank force unwedges any stuck I/O by aggresively
> closing the connections. It is most useful in the non-migration use
> cases though.
>

I asked because for socket at least yank and cancel do the same:
shutdown(). It might be more trouble than it's worth it, but I wonder if
we could have everything that can be stuck implement a yank routine and
just have cancel call those. So for instance, when this series does
sem_post on a bunch of semaphores explicitly, the cancel command would
instead call whatever yank routine was registered by the code that can
wait on the semaphore. With this there should be no surprises of a
cancel arriving at a weird moment, because we'd require "code that
sleeps" to implement a yank.

>> Migration users should have stick with "cancel" rather than "yank" - qmp
>> "yank" would "FAIL" the migration instead of showing CANCELLED, definitely
>> should avoid.  I am not aware of anybody that uses "yank" for migration at
>> all.
>> 
>> So yeah, both commands are slightly duplicated, and if we want to throw
>> one, it needs to be yank, not cancel.  I'm fine keeping both..
>
> I would say the difference is like a graceful shutdown vs pulling the
> power plug in a bare metal machine
>
> 'cancel' is intended to be graceful. It should leave you with a functional
> QEMU (or refuse to run if unsafe).
>
> 'yank' is intended to be forceful, letting you get out of bad situations
> that would otherwise require you to kill the entire QEMU process, but
> still with possible associated risk data loss to the QEMU backends.

For migration specifically I don't see much difference. Yank must leave
QEMU in a usable state enough for a second migration to succeed,
otherwise it's useless.

>
> They have overlap, but are none the less different.
>
> With regards,
> Daniel

Re: [PATCH 2/6] migration: Kick postcopy threads on cancel

Reply via email to