For some reason, connections which were closed were being put back on the work queue, causing a hang in trying to connect to a blocked node, or a crash trying to access a closed connection.
David provided a fix which introduced the CF_CLOSE flag, but which still could trigger the crash. Chrissie provided a fix which cleared the CONNECT_ and WRITE_PENDING flags, but which still could trigger the hang (I think because send_to_sock() would still attempt to connect in its retry path). I added a fix which avoided the unconditional call to send_to_sock() and also cancelled any work which might still be on the workqueue. Combined, these three fix the hangs and crashes I have been seeing when a node was killed (bugzilla.novell.com #52422). I'm not perfectly happy with this patch; it feels as if it is fixing symptoms. In particular, I don't quite understand where lowcomms_connect_to_sock() ends up being called from with the connection closed, but I've resisted the urge to insert a BUG() in the if clause there so far. Maybe someone else is inspired by this patch to reevaluate the connection handling completely ;-) Acked-by: teigl...@redhat.com Acked-by: ccaul...@redhat.com Index: dlm/lowcomms.c =================================================================== --- dlm.orig/lowcomms.c +++ dlm/lowcomms.c @@ -106,6 +106,7 @@ struct connection { #define CF_CONNECT_PENDING 3 #define CF_INIT_PENDING 4 #define CF_IS_OTHERCON 5 +#define CF_CLOSE 6 struct list_head writequeue; /* List of outgoing writequeue_entries */ spinlock_t writequeue_lock; int (*rx_action) (struct connection *); /* What to do when active */ @@ -299,6 +300,8 @@ static void lowcomms_write_space(struct static inline void lowcomms_connect_sock(struct connection *con) { + if (test_bit(CF_CLOSE, &con->flags)) + return; if (!test_and_set_bit(CF_CONNECT_PENDING, &con->flags)) queue_work(send_workqueue, &con->swork); } @@ -1370,6 +1373,15 @@ int dlm_lowcomms_close(int nodeid) log_print("closing connection to node %d", nodeid); con = nodeid2con(nodeid, 0); if (con) { + clear_bit(CF_CONNECT_PENDING, &con->flags); + clear_bit(CF_WRITE_PENDING, &con->flags); + set_bit(CF_CLOSE, &con->flags); + if (cancel_work_sync(&con->swork)) { + log_print("swork cancelled for node %d", nodeid); + } + if (cancel_work_sync(&con->rwork)) { + log_print("rwork cancelled for node %d", nodeid); + } clean_one_writequeue(con); close_connection(con, true); } @@ -1395,9 +1407,11 @@ static void process_send_sockets(struct if (test_and_clear_bit(CF_CONNECT_PENDING, &con->flags)) { con->connect_action(con); + set_bit(CF_WRITE_PENDING, &con->flags); + } + if (test_and_clear_bit(CF_WRITE_PENDING, &con->flags)) { + send_to_sock(con); } - clear_bit(CF_WRITE_PENDING, &con->flags); - send_to_sock(con); } Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde