Reviving this thread, as a good as place as any for my question: On Thu, Jun 10, 2021 at 01:07:50PM +0300, Vladimir Sementsov-Ogievskiy wrote: > Add an option for a thread to retry connection until succeeds. We'll > use nbd/client-connection both for reconnect and for initial connection > in nbd_open(), so we need a possibility to use same NBDClientConnection > instance to connect once in nbd_open() and then use retry semantics for > reconnect. > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> > --- > include/block/nbd.h | 2 ++ > nbd/client-connection.c | 56 +++++++++++++++++++++++++++++++---------- > 2 files changed, 45 insertions(+), 13 deletions(-)
> NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr, > bool do_negotiation, > const char *export_name, > @@ -154,23 +164,43 @@ static void *connect_thread_func(void *opaque) > NBDClientConnection *conn = opaque; > int ret; > bool do_free; > + uint64_t timeout = 1; > + uint64_t max_timeout = 16; > > - conn->sioc = qio_channel_socket_new(); > + while (true) { > + conn->sioc = qio_channel_socket_new(); > > - error_free(conn->err); > - conn->err = NULL; > - conn->updated_info = conn->initial_info; > + error_free(conn->err); > + conn->err = NULL; > + conn->updated_info = conn->initial_info; > > - ret = nbd_connect(conn->sioc, conn->saddr, > - conn->do_negotiation ? &conn->updated_info : NULL, > - conn->tlscreds, &conn->ioc, &conn->err); This says that on each retry attempt, we reset whether to ask the server for structured replies back to our original initial_info values. But when dealing with NBD retries in general, I suspect we have a bug. Consider what happens if our first connection requests structured replies and base:allocation block status, and we are successful. But later, the server disconnects, triggering a retry. Suppose that on our retry, we encounter a different server that no longer supports structured replies. We would no longer be justified in sending NBD_CMD_BLOCK_STATUS requests to the reconnected server. But I can't find anywhere in the code base that ensures that on a reconnect, the new server supplies at least as many extensions as the original server, nor anywhere that we would be able to gracefully handle an in-flight block status command that can no longer be successfully continued because the reconnect landed on a downgraded server. In general, you don't expect a server to downgrade its capabilities across restarts, so assuming that a retried connection will hit a server at least as capable as the original server is typical, even if unsafe. But it is easy enough to use nbdkit to write a server that purposefully downgrades its abilities after the first client connection, for testing how qemu falls apart if it continues making assumptions about the current server based solely on what it learned prior to retrying from the first server. Is this something we need to address quickly for inclusion in 6.2? Maybe by having a retry connect fail if the new server does not have the same capabilities as the old? Do we also need to care about a server reporting a different size export than the old server? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org