On Tue, Sep 23, 2025 at 04:58:15PM +0200, Juraj Marcin wrote:
> On 2025-09-22 11:51, Peter Xu wrote:
> > On Mon, Sep 22, 2025 at 02:58:38PM +0200, Juraj Marcin wrote:
> > > Hi Fabiano,
> > > 
> > > On 2025-09-19 13:46, Fabiano Rosas wrote:
> > > > Juraj Marcin <[email protected]> writes:
> > > > 
> > > > Hi Juraj,
> > > > 
> > > > Good patch, nice use of migrate_has_failed()
> > > 
> > > Thanks!
> > > 
> > > > 
> > > > > From: Juraj Marcin <[email protected]>
> > > > >
> > > > > Currently, there are two functions that are responsible for cleanup of
> > > > > the incoming migration state. With successful precopy, it's the main
> > > > > thread and with successful postcopy it's the listen thread. However, 
> > > > > if
> > > > > postcopy fails during in the device load, both functions will try to 
> > > > > do
> > > > > the cleanup. Moreover, when exit-on-error parameter was added, it was
> > > > > applied only to precopy.
> > > > >
> > > > 
> > > > Someone could be relying in postcopy always exiting on error while
> > > > explicitly setting exit-on-error=false for precopy and this patch would
> > > > change the behavior incompatibly. Is this an issue? I'm willing to
> > > > ignore it, but you guys know more about postcopy.
> > > 
> > > Good question. When going through older patches where postcopy listen
> > > thread and then where exit-on-error were implemented, it seemed more
> > > like an overlook than intentional omission. However, it might be better
> > > to not break any potential users of this, we could add another option,
> > > "exit-on-postcopy-error" that would allow such handling if postscopy
> > > failed unrecoverably. I've already talked about such option with
> > > @jdenemar and he agreed with it.
> > 
> > The idea for postcopy ram is, it should never fail.. as failing should
> > never be better than a pause.  Block dirty bitmap might be different,
> > though, when enabled separately.
> > 
> > For postcopy-ram, qemu_loadvm_state_main() will in reality only receive RAM
> > updates. It'll almost always trigger the postcopy_pause_incoming() path
> > when anything fails.
> > 
> > For pure block-dirty-bitmap-only styled postcopy: for this exit-on-error, I
> > also don't think we should really "exit on errors", even if the flag is
> > set.  IIUC, it's not fatal to the VM if that failed, as described in:
> 
> I agree, however, this patch doesn't add any new cases in which the
> destination QEMU would exit. If there is an error in block dirty bitmaps
> it is only reported to the console, and then it continues to waiting for
> main_thread_load_event, marks the migration as COMPLETED and does the
> cleanup, same as before. [1] I can add a comment similar to "prevent
> further exit" as there was before.
> 
> However, if there is other error, in which the postcopy cannot pause
> (for example there was a failure in the main thread loading the device
> state before the machine started), the migration status changes to
> FAILED and jumps right to cleanup which then checks exit-on-error and
> optionally exits the QEMU, before it would always exit in such case [2]:
> 
> [1]: 
> https://gitlab.com/qemu-project/qemu/-/blob/ab8008b231e758e03c87c1c483c03afdd9c02e19/migration/savevm.c#L2120
> [2]: 
> https://gitlab.com/qemu-project/qemu/-/blob/ab8008b231e758e03c87c1c483c03afdd9c02e19/migration/savevm.c#L2150

Ah, so you're saying an failure within qemu_loadvm_state_main() but during
POSTCOPY_INCOMING_LISTENING..  I think you're right.  It's possible.

In that case, exit-on-postcopy-error still sounds an overkill to me.  I
vote that we go ahead reusing exit-on-error for postcopy too. IIUC that's
what this current patch does as-is.

-- 
Peter Xu


Reply via email to