For some reason, connections which were closed were being put back on
the work queue, causing a hang in trying to connect to a blocked node,
or a crash trying to access a closed connection.

David provided a fix which introduced the CF_CLOSE flag, but which still
could trigger the crash. Chrissie provided a fix which cleared the
CONNECT_ and WRITE_PENDING flags, but which still could trigger the hang
(I think because send_to_sock() would still attempt to connect in its
retry path). I added a fix which avoided the unconditional call to
send_to_sock() and also cancelled any work which might still be on the
workqueue.

Combined, these three fix the hangs and crashes I have been seeing when
a node was killed (bugzilla.novell.com #52422). 

I'm not perfectly happy with this patch; it feels as if it is fixing
symptoms. In particular, I don't quite understand where
lowcomms_connect_to_sock() ends up being called from with the connection
closed, but I've resisted the urge to insert a BUG() in the if clause
there so far. Maybe someone else is inspired by this patch to reevaluate
the connection handling completely ;-)

Acked-by: teigl...@redhat.com
Acked-by: ccaul...@redhat.com

Index: dlm/lowcomms.c
===================================================================
--- dlm.orig/lowcomms.c
+++ dlm/lowcomms.c
@@ -106,6 +106,7 @@ struct connection {
 #define CF_CONNECT_PENDING 3
 #define CF_INIT_PENDING 4
 #define CF_IS_OTHERCON 5
+#define CF_CLOSE 6
        struct list_head writequeue;  /* List of outgoing writequeue_entries */
        spinlock_t writequeue_lock;
        int (*rx_action) (struct connection *); /* What to do when active */
@@ -299,6 +300,8 @@ static void lowcomms_write_space(struct
 
 static inline void lowcomms_connect_sock(struct connection *con)
 {
+       if (test_bit(CF_CLOSE, &con->flags))
+               return;
        if (!test_and_set_bit(CF_CONNECT_PENDING, &con->flags))
                queue_work(send_workqueue, &con->swork);
 }
@@ -1370,6 +1373,15 @@ int dlm_lowcomms_close(int nodeid)
        log_print("closing connection to node %d", nodeid);
        con = nodeid2con(nodeid, 0);
        if (con) {
+               clear_bit(CF_CONNECT_PENDING, &con->flags);
+               clear_bit(CF_WRITE_PENDING, &con->flags);
+               set_bit(CF_CLOSE, &con->flags);
+               if (cancel_work_sync(&con->swork)) {
+                       log_print("swork cancelled for node %d", nodeid);
+               }
+               if (cancel_work_sync(&con->rwork)) {
+                       log_print("rwork cancelled for node %d", nodeid);
+               }
                clean_one_writequeue(con);
                close_connection(con, true);
        }
@@ -1395,9 +1407,11 @@ static void process_send_sockets(struct
 
        if (test_and_clear_bit(CF_CONNECT_PENDING, &con->flags)) {
                con->connect_action(con);
+               set_bit(CF_WRITE_PENDING, &con->flags);
+       }
+       if (test_and_clear_bit(CF_WRITE_PENDING, &con->flags)) {
+               send_to_sock(con);
        }
-       clear_bit(CF_WRITE_PENDING, &con->flags);
-       send_to_sock(con);
 }
 
 
Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

Reply via email to