On Wed, Feb 10, 2016 at 7:55 PM, Bob Peterson <rpete...@redhat.com> wrote: > I've been doing a bunch of recovery testing with DLM and discovered some > issues. This collection of 6 patches addresses those issues. Some of them > are of my own making, introduced by the recent patches that made DLM > print socket connection errors, and recovery from those errors. > > The first patch changes the TCP "connect to sock" function to more closely > match the SCTP version of the function. The idea is to not create a kernel > socket until we have a valid node address, like it does in the SCTP path. > > The second patch removes a "return" from lowcomms_error_report that should > not be there. The return was causing it to bypass calling the original > error report code, thus skipping an important part in the reporting. > > The third patch changes function tcp_create_listen_sock so that its > error path is consistent. Only one of its error paths was setting > con->sock to NULL, but it should be done in both cases. > > The fourth patch eliminates a useless goto, to make the code more clear. > > The fifth patch adds a layer of locking by way of the sk->sk_callback_lock > which is needed to prevent multiple send/receive sockets from > interfering with one another when reporting the socket errors and > subsequent recovery. This makes it similar to how sunrpc handles errors. > > The sixth and final patch makes the socket error code save and restore > all four callbacks, whereas before we were only saving and restoring the > error report callback.
This patch set makes removing lockspaces a lot more robust for me. One test case that triggers NULL pointer dereferences in callbacks from TCP to DLM regularly without these is removing a lockspace on three cluster nodes "simultaneously". Thanks, Andreas