* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: > If we detect some error in colo, we will wait for some time, > hoping users also detect it. If users don't issue failover command. > We will go into default failover procedure, which the PVM will takeover > work while SVM is exit in default.
I'm not sure this is needed; especially on the SVM. I don't see any harm in the SVM waiting forever to be told what to do - it could be told to failover or quit; I don't see any benefit to it automatically exiting. In the primary, I can see if you didn't have some automated error detection system then I can understand it (but I think it's rare); but you really would want to make that failover delay configurable so that you could turn it off in a system that did have failure detection; because automatically restarting the primary after it had caused a failover to the secondary would be very bad. Dave > > Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com> > Signed-off-by: Li Zhijian <lizhij...@cn.fujitsu.com> > --- > migration/colo.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 46 insertions(+) > > diff --git a/migration/colo.c b/migration/colo.c > index f31e957..1e6d3dd 100644 > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -19,6 +19,14 @@ > #include "qemu/sockets.h" > #include "migration/failover.h" > > +/* > + * The delay time before qemu begin the procedure of default failover > treatment. > + * Unit: ms > + * Fix me: This value should be able to change by command > + * 'migrate-set-parameters' > + */ > +#define DEFAULT_FAILOVER_DELAY 2000 > + > /* colo buffer */ > #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024) > > @@ -264,6 +272,7 @@ static void colo_process_checkpoint(MigrationState *s) > { > QEMUSizedBuffer *buffer = NULL; > int64_t current_time, checkpoint_time = > qemu_clock_get_ms(QEMU_CLOCK_HOST); > + int64_t error_time; > int ret = 0; > uint64_t value; > > @@ -322,8 +331,25 @@ static void colo_process_checkpoint(MigrationState *s) > } > > out: > + current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); > if (ret < 0) { > error_report("%s: %s", __func__, strerror(-ret)); > + /* Give users time to get involved in this verdict */ > + while (current_time - error_time <= DEFAULT_FAILOVER_DELAY) { > + if (failover_request_is_active()) { > + error_report("Primary VM will take over work"); > + break; > + } > + usleep(100 * 1000); > + current_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); > + } > + > + qemu_mutex_lock_iothread(); > + if (!failover_request_is_active()) { > + error_report("Primary VM will take over work in default"); > + failover_request_active(NULL); > + } > + qemu_mutex_unlock_iothread(); > } > > qsb_free(buffer); > @@ -384,6 +410,7 @@ void *colo_process_incoming_thread(void *opaque) > QEMUFile *fb = NULL; > QEMUSizedBuffer *buffer = NULL; /* Cache incoming device state */ > uint64_t total_size; > + int64_t error_time, current_time; > int ret = 0; > uint64_t value; > > @@ -499,9 +526,28 @@ void *colo_process_incoming_thread(void *opaque) > } > > out: > + current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); > if (ret < 0) { > error_report("colo incoming thread will exit, detect error: %s", > strerror(-ret)); > + /* Give users time to get involved in this verdict */ > + while (current_time - error_time <= DEFAULT_FAILOVER_DELAY) { > + if (failover_request_is_active()) { > + error_report("Secondary VM will take over work"); > + break; > + } > + usleep(100 * 1000); > + current_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); > + } > + /* check flag again*/ > + if (!failover_request_is_active()) { > + /* > + * We assume that Primary VM is still alive according to > + * heartbeat, just kill Secondary VM > + */ > + error_report("SVM is going to exit in default!"); > + exit(1); > + } > } > > if (fb) { > -- > 1.8.3.1 > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK