On Tue, Sep 23, 2025 at 04:58:15PM +0200, Juraj Marcin wrote: > On 2025-09-22 11:51, Peter Xu wrote: > > On Mon, Sep 22, 2025 at 02:58:38PM +0200, Juraj Marcin wrote: > > > Hi Fabiano, > > > > > > On 2025-09-19 13:46, Fabiano Rosas wrote: > > > > Juraj Marcin <[email protected]> writes: > > > > > > > > Hi Juraj, > > > > > > > > Good patch, nice use of migrate_has_failed() > > > > > > Thanks! > > > > > > > > > > > > From: Juraj Marcin <[email protected]> > > > > > > > > > > Currently, there are two functions that are responsible for cleanup of > > > > > the incoming migration state. With successful precopy, it's the main > > > > > thread and with successful postcopy it's the listen thread. However, > > > > > if > > > > > postcopy fails during in the device load, both functions will try to > > > > > do > > > > > the cleanup. Moreover, when exit-on-error parameter was added, it was > > > > > applied only to precopy. > > > > > > > > > > > > > Someone could be relying in postcopy always exiting on error while > > > > explicitly setting exit-on-error=false for precopy and this patch would > > > > change the behavior incompatibly. Is this an issue? I'm willing to > > > > ignore it, but you guys know more about postcopy. > > > > > > Good question. When going through older patches where postcopy listen > > > thread and then where exit-on-error were implemented, it seemed more > > > like an overlook than intentional omission. However, it might be better > > > to not break any potential users of this, we could add another option, > > > "exit-on-postcopy-error" that would allow such handling if postscopy > > > failed unrecoverably. I've already talked about such option with > > > @jdenemar and he agreed with it. > > > > The idea for postcopy ram is, it should never fail.. as failing should > > never be better than a pause. Block dirty bitmap might be different, > > though, when enabled separately. > > > > For postcopy-ram, qemu_loadvm_state_main() will in reality only receive RAM > > updates. It'll almost always trigger the postcopy_pause_incoming() path > > when anything fails. > > > > For pure block-dirty-bitmap-only styled postcopy: for this exit-on-error, I > > also don't think we should really "exit on errors", even if the flag is > > set. IIUC, it's not fatal to the VM if that failed, as described in: > > I agree, however, this patch doesn't add any new cases in which the > destination QEMU would exit. If there is an error in block dirty bitmaps > it is only reported to the console, and then it continues to waiting for > main_thread_load_event, marks the migration as COMPLETED and does the > cleanup, same as before. [1] I can add a comment similar to "prevent > further exit" as there was before. > > However, if there is other error, in which the postcopy cannot pause > (for example there was a failure in the main thread loading the device > state before the machine started), the migration status changes to > FAILED and jumps right to cleanup which then checks exit-on-error and > optionally exits the QEMU, before it would always exit in such case [2]: > > [1]: > https://gitlab.com/qemu-project/qemu/-/blob/ab8008b231e758e03c87c1c483c03afdd9c02e19/migration/savevm.c#L2120 > [2]: > https://gitlab.com/qemu-project/qemu/-/blob/ab8008b231e758e03c87c1c483c03afdd9c02e19/migration/savevm.c#L2150
Ah, so you're saying an failure within qemu_loadvm_state_main() but during POSTCOPY_INCOMING_LISTENING.. I think you're right. It's possible. In that case, exit-on-postcopy-error still sounds an overkill to me. I vote that we go ahead reusing exit-on-error for postcopy too. IIUC that's what this current patch does as-is. -- Peter Xu
