I'm currently trying to debug a hang in 'nbdkit nbd' as a client when TLS is in use, when dealing with three processes: nbdkit example1[1] <=tls=> nbdkit nbd[2] <=plain=> qemu-nbd --list[3]
In theory, when process [3] exits, the server side of process [2] is supposed to disconnect as client from process [1]. But we were hitting sporadic hangs where process [2] was stuck in a poll() while in gnutls cleanup code, which appears to be because it did not recognize a clean shutdown from process [3]. I do not yet have a strong reproducible test case (which might include adding strategic sleep()s in one-off compilations) to prove that the problem is an unclean TLS shutdown, but at the same time, the gnutls documentation is clear that without proper use of gnutls_bye(), a client cannot distinguish between clean end of communications and a malicious interruption in service. Shutting down the write side of a bi-directional socket as soon as possible makes it easier for a client to begin its own shutdown actions, and can help in avoiding deadlocks where both sides of a connection are waiting for the other side to close() the socket first. If I can come up with a reliable test case, this series may be a candidate for 5.0; but for now, I'm only proposing it for 5.1 (we've had more than one release where qemu was not exploiting the full power of gnutls_bye(), and even if that triggers poor interaction with other endpoints such as nbdkit, going one more release with the same behavior isn't making things worse). Eric Blake (3): crypto: Add qcrypto_tls_shutdown() io: Support shutdown of TLS channel nbd: Use shutdown(SHUT_WR) after last item sent qapi/crypto.json | 15 +++++++++++++++ include/crypto/tlssession.h | 24 ++++++++++++++++++++++++ block/nbd.c | 1 + crypto/tlssession.c | 27 +++++++++++++++++++++++++++ io/channel-tls.c | 27 ++++++++++++++++++++++++++- nbd/client.c | 3 ++- nbd/server.c | 4 ++++ 7 files changed, 99 insertions(+), 2 deletions(-) -- 2.26.0.rc2