* Peter Xu (pet...@redhat.com) wrote: > On Thu, Sep 21, 2017 at 08:21:45PM +0100, Dr. David Alan Gilbert wrote: > > * Peter Xu (pet...@redhat.com) wrote: > > > Now when network down for postcopy, the source side will not fail the > > > migration. Instead we convert the status into this new paused state, and > > > we will try to wait for a rescue in the future. > > > > > > If a recovery is detected, migration_thread() will reset its local > > > variables to prepare for that. > > > > > > Signed-off-by: Peter Xu <pet...@redhat.com> > > > --- > > > migration/migration.c | 98 > > > +++++++++++++++++++++++++++++++++++++++++++++++--- > > > migration/migration.h | 3 ++ > > > migration/trace-events | 1 + > > > 3 files changed, 98 insertions(+), 4 deletions(-) > > > > > > diff --git a/migration/migration.c b/migration/migration.c > > > index f6130db..8d26ea8 100644 > > > --- a/migration/migration.c > > > +++ b/migration/migration.c > > > @@ -993,6 +993,8 @@ static void migrate_fd_cleanup(void *opaque) > > > > > > notifier_list_notify(&migration_state_notifiers, s); > > > block_cleanup_parameters(s); > > > + > > > + qemu_sem_destroy(&s->postcopy_pause_sem); > > > } > > > > > > void migrate_fd_error(MigrationState *s, const Error *error) > > > @@ -1136,6 +1138,7 @@ MigrationState *migrate_init(void) > > > s->migration_thread_running = false; > > > error_free(s->error); > > > s->error = NULL; > > > + qemu_sem_init(&s->postcopy_pause_sem, 0); > > > > > > migrate_set_state(&s->state, MIGRATION_STATUS_NONE, > > > MIGRATION_STATUS_SETUP); > > > > > > @@ -1938,6 +1941,80 @@ bool migrate_colo_enabled(void) > > > return s->enabled_capabilities[MIGRATION_CAPABILITY_X_COLO]; > > > } > > > > > > +typedef enum MigThrError { > > > + /* No error detected */ > > > + MIG_THR_ERR_NONE = 0, > > > + /* Detected error, but resumed successfully */ > > > + MIG_THR_ERR_RECOVERED = 1, > > > + /* Detected fatal error, need to exit */ > > > + MIG_THR_ERR_FATAL = 2, > > > > I don't think it's necessary to assign the values there, but it's OK. > > > > > +} MigThrError; > > > + > > > +/* > > > + * We don't return until we are in a safe state to continue current > > > + * postcopy migration. Returns MIG_THR_ERR_RECOVERED if recovered, or > > > + * MIG_THR_ERR_FATAL if unrecovery failure happened. > > > + */ > > > +static MigThrError postcopy_pause(MigrationState *s) > > > +{ > > > + assert(s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE); > > > + migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, > > > + MIGRATION_STATUS_POSTCOPY_PAUSED); > > > + > > > + /* Current channel is possibly broken. Release it. */ > > > + assert(s->to_dst_file); > > > + qemu_file_shutdown(s->to_dst_file); > > > + qemu_fclose(s->to_dst_file); > > > + s->to_dst_file = NULL; > > > + > > > + error_report("Detected IO failure for postcopy. " > > > + "Migration paused."); > > > + > > > + /* > > > + * We wait until things fixed up. Then someone will setup the > > > + * status back for us. > > > + */ > > > + while (s->state == MIGRATION_STATUS_POSTCOPY_PAUSED) { > > > + qemu_sem_wait(&s->postcopy_pause_sem); > > > + } > > > + > > > + trace_postcopy_pause_continued(); > > > + > > > + return MIG_THR_ERR_RECOVERED; > > > +} > > > + > > > +static MigThrError migration_detect_error(MigrationState *s) > > > +{ > > > + int ret; > > > + > > > + /* Try to detect any file errors */ > > > + ret = qemu_file_get_error(s->to_dst_file); > > > + > > > + if (!ret) { > > > + /* Everything is fine */ > > > + return MIG_THR_ERR_NONE; > > > + } > > > + > > > + if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret == -EIO) { > > > > We do need to make sure that whenever we hit a failure in migration > > due to a device that we pass that up rather than calling > > qemu_file_set_error - e.g. an EIO in a block device or network. > > > > However, > > > > Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com> > > I'll take the R-b first. :) > > Regarding to above - aren't we currently detecting these kind of > errors using -EIO? And network down should be only one of such case? > > For now I still cannot distinguish network down out of something worse > that cannot even be recovered. No matter what, current code will go > into PAUSED state as long as EIO is got. I thought about it, and for > now I don't think it is a problem, since even if it is a critical > failure and unable to recover in any way, we still won't lose anything > if we stop the VM at once (that's what paused state do - VM is just > stopped). For the critical failures, we will just find out that the > recovery will fail again rather than success.
Yes I think it's fine for now; my suspicion is that sometimes errors from devices (e.g. disk/NIC) end up in the qemu_file_set_error - but they shouldn't, I think we should try and keep that just for actual migration stream transport errors, and then this patch is safe. Dave > -- > Peter Xu -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK