On Wed, Feb 10, 2016 at 7:55 PM, Bob Peterson <rpete...@redhat.com> wrote:
> I've been doing a bunch of recovery testing with DLM and discovered some
> issues. This collection of 6 patches addresses those issues. Some of them
> are of my own making, introduced by the recent patches that made DLM
> print socket connection errors, and recovery from those errors.
>
> The first patch changes the TCP "connect to sock" function to more closely
> match the SCTP version of the function. The idea is to not create a kernel
> socket until we have a valid node address, like it does in the SCTP path.
>
> The second patch removes a "return" from lowcomms_error_report that should
> not be there. The return was causing it to bypass calling the original
> error report code, thus skipping an important part in the reporting.
>
> The third patch changes function tcp_create_listen_sock so that its
> error path is consistent. Only one of its error paths was setting
> con->sock to NULL, but it should be done in both cases.
>
> The fourth patch eliminates a useless goto, to make the code more clear.
>
> The fifth patch adds a layer of locking by way of the sk->sk_callback_lock
> which is needed to prevent multiple send/receive sockets from
> interfering with one another when reporting the socket errors and
> subsequent recovery. This makes it similar to how sunrpc handles errors.
>
> The sixth and final patch makes the socket error code save and restore
> all four callbacks, whereas before we were only saving and restoring the
> error report callback.

This patch set makes removing lockspaces a lot more robust for me. One
test case that triggers NULL pointer dereferences in callbacks from
TCP to DLM regularly without these is removing a lockspace on three
cluster nodes "simultaneously".

Thanks,
Andreas

Reply via email to