Reviving this thread, as a good as place as any for my question:

On Thu, Jun 10, 2021 at 01:07:50PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Add an option for a thread to retry connection until succeeds. We'll
> use nbd/client-connection both for reconnect and for initial connection
> in nbd_open(), so we need a possibility to use same NBDClientConnection
> instance to connect once in nbd_open() and then use retry semantics for
> reconnect.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
> ---
>  include/block/nbd.h     |  2 ++
>  nbd/client-connection.c | 56 +++++++++++++++++++++++++++++++----------
>  2 files changed, 45 insertions(+), 13 deletions(-)

>  NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
>                                                 bool do_negotiation,
>                                                 const char *export_name,
> @@ -154,23 +164,43 @@ static void *connect_thread_func(void *opaque)
>      NBDClientConnection *conn = opaque;
>      int ret;
>      bool do_free;
> +    uint64_t timeout = 1;
> +    uint64_t max_timeout = 16;
>  
> -    conn->sioc = qio_channel_socket_new();
> +    while (true) {
> +        conn->sioc = qio_channel_socket_new();
>  
> -    error_free(conn->err);
> -    conn->err = NULL;
> -    conn->updated_info = conn->initial_info;
> +        error_free(conn->err);
> +        conn->err = NULL;
> +        conn->updated_info = conn->initial_info;
>  
> -    ret = nbd_connect(conn->sioc, conn->saddr,
> -                      conn->do_negotiation ? &conn->updated_info : NULL,
> -                      conn->tlscreds, &conn->ioc, &conn->err);

This says that on each retry attempt, we reset whether to ask the
server for structured replies back to our original initial_info
values.

But when dealing with NBD retries in general, I suspect we have a bug.
Consider what happens if our first connection requests structured
replies and base:allocation block status, and we are successful.  But
later, the server disconnects, triggering a retry.  Suppose that on
our retry, we encounter a different server that no longer supports
structured replies.  We would no longer be justified in sending
NBD_CMD_BLOCK_STATUS requests to the reconnected server.  But I can't
find anywhere in the code base that ensures that on a reconnect, the
new server supplies at least as many extensions as the original
server, nor anywhere that we would be able to gracefully handle an
in-flight block status command that can no longer be successfully
continued because the reconnect landed on a downgraded server.

In general, you don't expect a server to downgrade its capabilities
across restarts, so assuming that a retried connection will hit a
server at least as capable as the original server is typical, even if
unsafe.  But it is easy enough to use nbdkit to write a server that
purposefully downgrades its abilities after the first client
connection, for testing how qemu falls apart if it continues making
assumptions about the current server based solely on what it learned
prior to retrying from the first server.

Is this something we need to address quickly for inclusion in 6.2?
Maybe by having a retry connect fail if the new server does not have
the same capabilities as the old?  Do we also need to care about a
server reporting a different size export than the old server?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


Reply via email to