[Pvfs2-developers] Re: halloween bug (possible patch update)

Sam Lang Wed, 17 Oct 2007 09:30:11 -0700


Hi Phil,

Thanks for the patch and the quick turnaround. I've made somecomments inline. You're much more familiar with the design of thecode (having written it :-), so maybe I'm missing the big picture,but I wonder if this is more workaround then we need right now.


On Oct 17, 2007, at 10:33 AM, Phil Carns wrote:

Actually, one more follow up on this specific patch to wrap up.The problem that I suspected actually does not occur, although itmay have been by luck :) There is a tcp_addr_data->bmi_addr fieldthat gets used as an argument to the forget_callback() function.That bmi_addr field is set to zero unless it is filled in by theaddr_reg_callback() function. That means that on the client side,the forget_callback() function actually just fails because itsearches for a BMI_addr_t of value 0 in the reference list.I just tested it out, and pvfs2-client-core was able to recoverfrom a network error just fine with the patch in place.
I still agree that it would probably be better in the long run toeliminate the bmi ref list, but I kept looking at the currentsituation to get a fix in the meantime. The attached patch buildson what is currently in trunk with a few changes:
- removes additional ref list lock that was added toaddr_list_callback() with one of the recent commits. I think thiscleanup was not directly related to the halloween bug. It appearsreasonable, but it introduced a lock ordering problem between theref list mutex and bmi_tcp's interface mutex.- The new forget_callback() function had three possible issues: thesame lock ordering issue as in the above bullet (although therecent commit to release the interface mutex may have also fixed it),

the possibility of deleting an address with a non-zero referencecount,

There's no way to send the response to the client once thatconnection is lost, so why keep the address reference around? Theforget_callback was only getting called if the connection hadclosed. If a request was in progress (the only way the referencecount could be nonzero), then an error is going to occur either way-- when the decrement is attempted, or the response send is attempted.

and also I think in bmi_tcp it didn't take care of freeingbmi_tcp's internal method address.

forget_callback is getting called before the method address iscleaned up and deallocated locally. Do you mean the forget callbackdoesn't call back into the method to do the cleanup?

I changed the forget_callback implementation to just mark thetarget addresses in a queue for the bmi control layer to look atlater rather than trying to immediately free the address. I thinkthis fixes all three problems by having the control layer be thecomponent that drives the address release from the top down (usingthe same mechanism already in place for when the reference countreaches zero).

I have to be honest, adding another queue to manage addresses seemslike heading in the wrong direction. It pushes the reference removeto some point in the future, when a separate request is going to gethit with needing to remove other references (the reference decrementblocks the state action). Depending on the number of operations inplay, I could imagine random operations getting hit with a noticeablelatency having to remove a bunch of other references from the list.I hear what your saying about a top-down approach, but that's notconsistent with the register_callback triggering (from the bmi-tcplayer) the addition of a reference on a new connection event.

I guess in general I would argue that the methods are the eventproducers, and they choose to either queue events for the job layerto manage or handle them internally. The middle bmi wrapper layershouldn't also be managing things there. I know you're not arguingagainst that, but as a temporary fix, I'd rather see events gethandled directly rather than queued by the bmi wrappers for later.

- moved the code in the set_info() function for dropping a bmiaddress into a subroutine so that the same code path is used forboth the forget_callback() and the set_info(BMI_DROP_ADDR) case.
Basically I think we already had a code path that did the rightthing (in terms of bmi_tcp) in place to get triggered when a bmiaddr reference count hit zero. The problem in this slowdownsituation, though, was that the socket was being closed _after_ theref count hit zero.

There was never any activity on the addr after that to triggerthe cleanup. The forget_callback() essentially is now givingmethods a way to tell the control layer to consider dropping theaddress again independent of what the reference count has done.


But what's to consider if the connection is closed?

I'm still testing on this some, but I wanted to go ahead and throwit out there for discussion.

Again, thanks for the patch. If it turns out to fix problems andpass tests, I'm cool with adding it -- I guess I just like to argue. :-)


-sam


-Phil
diff -Naur pvfs2-stock/src/io/bmi/bmi.c pvfs2/src/io/bmi/bmi.c
--- pvfs2-stock/src/io/bmi/bmi.c        2007-10-09 17:58:29.000000000 -0400
+++ pvfs2/src/io/bmi/bmi.c      2007-10-17 10:44:21.000000000 -0400
@@ -38,6 +38,15 @@
 static gen_mutex_t context_mutex = GEN_MUTEX_INITIALIZER;
 static gen_mutex_t ref_mutex = GEN_MUTEX_INITIALIZER;

+static QLIST_HEAD(forget_list);
+static gen_mutex_t forget_list_mutex = GEN_MUTEX_INITIALIZER;
+
+struct forget_item
+{
+    struct qlist_head link;
+    PVFS_BMI_addr_t addr;
+};
+
 /*
  * Static list of defined BMI methods.  These are pre-compiled into
  * the client libraries and into the server.
@@ -106,6 +115,8 @@

 static int activate_method(const char *name, const char *listen_addr,
     int flags);
+static void bmi_addr_drop(ref_st_p tmp_ref);
+static void bmi_check_forget_list(void);

 /** Initializes the BMI layer.  Must be called before any other BMI
  *  functions.
@@ -910,6 +921,9 @@
     ref_st_p tmp_ref = NULL;
     int tmp_active_method_count = 0;

+    /* figure out if we need to drop any stale addresses */
+    bmi_check_forget_list();
+
     gen_mutex_lock(&active_method_count_mutex);
     tmp_active_method_count = active_method_count;
     gen_mutex_unlock(&active_method_count_mutex);
@@ -1277,27 +1291,7 @@

        if(tmp_ref->ref_count == 0)
        {
-           struct method_drop_addr_query query;
-           query.response = 0;
-           query.addr = tmp_ref->method_addr;
-           /* reference count is zero; ask module if it wants us to discard
-            * the address; TCP will tell us to drop addresses for which the
-            * socket has died with no possibility of reconnect
-            */
-           ret = tmp_ref->interface->BMI_meth_get_info(BMI_DROP_ADDR_QUERY,
-               &query);
-           if(ret == 0 && query.response == 1)
-           {
-               /* kill the address */
-               gossip_debug(GOSSIP_BMI_DEBUG_CONTROL,
-                   "[BMI CONTROL]: %s: bmi discarding address: %llu\n",
-                    __func__, llu(addr));
-               ref_list_rem(cur_ref_list, addr);
-               /* NOTE: this triggers request to module to free underlying
-                * resources if it wants to
-                */
-               dealloc_ref_st(tmp_ref);
-           }
+            bmi_addr_drop(tmp_ref);
        }
        gen_mutex_unlock(&ref_mutex);
        return(0);
@@ -1907,33 +1901,30 @@
     new_ref->interface = active_method_table[map->method_type];

     /* add the reference structure to the list */
-    gen_mutex_lock(&ref_mutex);
     ref_list_add(cur_ref_list, new_ref);
-    gen_mutex_unlock(&ref_mutex);

     return new_ref->bmi_addr;
 }

 int bmi_method_addr_forget_callback(PVFS_BMI_addr_t addr)
 {
-    ref_st_p ref;
+    struct forget_item* tmp_item = NULL;

-    gen_mutex_lock(&ref_mutex);
-    ref = ref_list_search_addr(cur_ref_list, addr);
-    if (!ref)

+ tmp_item = (struct forget_item*)malloc(sizeof(structforget_item));

+    if(!tmp_item)
     {
-       gen_mutex_unlock(&ref_mutex);
-       return (bmi_errno_to_pvfs(-EPROTO));
+        return(bmi_errno_to_pvfs(-ENOMEM));
     }
-    gen_mutex_unlock(&ref_mutex);

-    ref_list_rem(cur_ref_list, ref->bmi_addr);
+    tmp_item->addr = addr;

-    /* have to set the method_addr to null before deallocating, since

- * dealloc_ref_st tries to enter the method again to drop theaddr+ /* add to queue of items that we want the BMI control layer toconsider

+     * deallocating
      */
-    ref->method_addr = NULL;
-    dealloc_ref_st(ref);
+    gen_mutex_lock(&forget_list_mutex);
+    qlist_add(&tmp_item->link, &forget_list);
+    gen_mutex_unlock(&forget_list_mutex);
+
     return (0);
 }

@@ -2102,6 +2093,81 @@
     return bmi_errno;
 }

+/* bmi_check_forget_list()
+ *

+ * Scans queue of items that methods have suggested that we forgetabout

+ *
+ * no return value
+ */
+static void bmi_check_forget_list(void)
+{
+    PVFS_BMI_addr_t tmp_addr;
+    struct forget_item* tmp_item;
+    ref_st_p tmp_ref = NULL;
+
+    gen_mutex_lock(&forget_list_mutex);
+    while(!qlist_empty(&forget_list))
+    {
+        tmp_item = qlist_entry(forget_list.next, struct forget_item,
+            link);
+        qlist_del(&tmp_item->link);

+ /* item is off of the list; unlock for a moment while wework on

+         * this addr
+         */
+        gen_mutex_unlock(&forget_list_mutex);
+        tmp_addr = tmp_item->addr;
+        free(tmp_item);
+
+        gen_mutex_lock(&ref_mutex);
+        tmp_ref = ref_list_search_addr(cur_ref_list, tmp_addr);
+        if(tmp_ref && tmp_ref->ref_count == 0)
+        {
+            bmi_addr_drop(tmp_ref);
+        }
+        gen_mutex_unlock(&ref_mutex);
+
+        gen_mutex_lock(&forget_list_mutex);
+    }
+    gen_mutex_unlock(&forget_list_mutex);
+
+    return;
+}
+
+/* bmi_addr_drop
+ *

+ * Destroys a complete BMI address, including asking the method toclean up+ * its portion. Will query the method for permission beforeproceeding

+ *
+ * NOTE: must be called with ref list mutex held
+ */
+static void bmi_addr_drop(ref_st_p tmp_ref)
+{
+    struct method_drop_addr_query query;
+    query.response = 0;
+    query.addr = tmp_ref->method_addr;
+    int ret = 0;
+
+    /* reference count is zero; ask module if it wants us to discard
+     * the address; TCP will tell us to drop addresses for which the
+     * socket has died with no possibility of reconnect
+     */
+    ret = tmp_ref->interface->BMI_meth_get_info(BMI_DROP_ADDR_QUERY,
+        &query);
+    if(ret == 0 && query.response == 1)
+    {
+        /* kill the address */
+        gossip_debug(GOSSIP_BMI_DEBUG_CONTROL,
+            "[BMI CONTROL]: %s: bmi discarding address: %llu\n",
+            __func__, llu(tmp_ref->bmi_addr));
+        ref_list_rem(cur_ref_list, tmp_ref->bmi_addr);
+        /* NOTE: this triggers request to module to free underlying
+         * resources if it wants to
+         */
+        dealloc_ref_st(tmp_ref);
+    }
+    return;
+}
+
 /*
  * Local variables:
  *  c-indent-level: 4

diff -Naur pvfs2-stock/src/io/bmi/bmi_tcp/bmi-tcp.c pvfs2/src/io/bmi/bmi_tcp/bmi-tcp.c--- pvfs2-stock/src/io/bmi/bmi_tcp/bmi-tcp.c 2007-10-1618:53:44.000000000 -0400+++ pvfs2/src/io/bmi/bmi_tcp/bmi-tcp.c 2007-10-1710:51:02.000000000 -0400

@@ -1855,9 +1855,6 @@
            0, &tmp_outcount, &tmp_addr, &tmp_status, 0, &interface_mutex);
     }

-    gen_mutex_unlock(&interface_mutex);
-    bmi_method_addr_forget_callback(tcp_addr_data->bmi_addr);
-    gen_mutex_lock(&interface_mutex);
     tcp_shutdown_addr(map);
     tcp_cleanse_addr(map, error_code);
     tcp_addr_data->addr_error = error_code;
@@ -1865,6 +1862,13 @@
     {
        dealloc_tcp_method_addr(map);
     }
+    else
+    {
+        /* this will cause the bmi control layer to check to see if
+         * this address can be completely forgotten
+         */
+        bmi_method_addr_forget_callback(tcp_addr_data->bmi_addr);
+    }
     return;
 };


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

[Pvfs2-developers] Re: halloween bug (possible patch update)

Reply via email to