On Wed, Jul 12, 2017 at 01:36:07PM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (pet...@redhat.com) wrote: > > On Tue, Jul 04, 2017 at 07:49:13PM +0100, Dr. David Alan Gilbert (git) > > wrote: > > > From: "Dr. David Alan Gilbert" <dgilb...@redhat.com> > > > > > > When waiting for a WRID, if the other side dies we end up waiting > > > for ever with no way to cancel the migration. > > > Cure this by poll()ing the fd first with a timeout and checking > > > error flags and migration state. > > > > > > Signed-off-by: Dr. David Alan Gilbert <dgilb...@redhat.com> > > > --- > > > migration/rdma.c | 54 > > > ++++++++++++++++++++++++++++++++++++++++++++++++------ > > > 1 file changed, 48 insertions(+), 6 deletions(-) > > > > > > diff --git a/migration/rdma.c b/migration/rdma.c > > > index 6111e10c70..7273ae9929 100644 > > > --- a/migration/rdma.c > > > +++ b/migration/rdma.c > > > @@ -1466,6 +1466,52 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, > > > uint64_t *wr_id_out, > > > return 0; > > > } > > > > > > +/* Wait for activity on the completion channel. > > > + * Returns 0 on success, none-0 on error. > > > + */ > > > +static int qemu_rdma_wait_comp_channel(RDMAContext *rdma) > > > +{ > > > + /* > > > + * Coroutine doesn't start until migration_fd_process_incoming() > > > + * so don't yield unless we know we're running inside of a coroutine. > > > + */ > > > + if (rdma->migration_started_on_destination) { > > > + yield_until_fd_readable(rdma->comp_channel->fd); > > > + } else { > > > + /* This is the source side, we're in a separate thread > > > + * or destination prior to migration_fd_process_incoming() > > > + * we can't yield; so we have to poll the fd. > > > + * But we need to be able to handle 'cancel' or an error > > > + * without hanging forever. > > > + */ > > > + while (!rdma->error_state && !rdma->error_reported && > > > + !rdma->received_error) { > > > > Should we just check rdma->error_state? Since iiuc error_reported only > > means whether error is reported (set to 1 if reported), and > > received_error means whether we got explicit error from peer (I think > > only via a RDMA_CONTROL_ERROR message), and it's 1 if received. IIUC > > either error_reported or received_error will mean that error_state be > > set already. > > I think I agree we don't need to check both error_state and > error_reported; I think it's best to check received_error since if > the other end fails we want this end to stop waiting for anything else. > I've not convinced myself that every path from receiving that error > till this point would have set the error_state (although it probably > has); so I suggest: > while (!rdma->error_state && !rdma->received_error) {
qemu_rdma_exchange_get_response() is the only place I see to set the received_error. And when that is triggered, the stack would be: qemu_rdma_exchange_get_response() set received_error=true, ret -EIO <-- qemu_rdma_exchange_recv() return the -EIO <-- qio_channel_rdma_readv() pass the -EIO to error_state, or <-- qemu_rdma_registration_handle() pass the -EIO to error_state But even so, I'm totally fine that we check both. > > > > + GPollFD pfds[1]; > > > + pfds[0].fd = rdma->comp_channel->fd; > > > + pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR; > > > + /* 0.5s timeout, should be fine for a 'cancel' */ > > > + switch (qemu_poll_ns(pfds, 1, 500 * 1000 * 1000)) { > > > > (I would be more aggresive, say, 100ms, or even less, for better UI > > response with merely nothing lost. But 500ms is okay as well to me) > > Yes, it felt fairly arbitrary here. > > > > + case 1: /* fd active */ > > > + return 0; > > > + > > > + case 0: /* Timeout, go around again */ > > > + break; > > > + > > > + default: /* Error of some type */ > > > + return -1; > > > + } > > > + > > > + if (migrate_get_current()->state == > > > MIGRATION_STATUS_CANCELLING) { > > > + /* Bail out and let the cancellation happen */ > > > + return -EPIPE; > > > + } > > > + } > > > + } > > > + > > > + return rdma->error_state || rdma->error_reported || > > > + rdma->received_error; > > > > Similar question here. Maybe just return rdma->error_state? > > As above, I could just drop the error_reported part. Sure. Will reply on the new version. -- Peter Xu