Re: [Pvfs2-developers] Re: the halloween bug fixed

Sam Lang Thu, 11 Oct 2007 09:04:31 -0700


On Oct 11, 2007, at 10:15 AM, Phil Carns wrote:

Sam Lang wrote:
The attached patch is the proposed fix for this problem. Whenthe tcp method receives a disconnect from a peer, it invokes acallback (bmi_method_addr_forget_callback) into the bmi controllayer to remove the address reference from the list. Maybe Ishould also add a counter and limit on how bit the list can get,al though that would involve potentially forcing long-livedconnections to reconnect periodically, and all methods would haveto implement BMI_set_info (DROP_ADDR).With tcp, new connections are registered, even if they are fromthe same host/port on the peer, whereas the other methods seem toonly register new host/port endpoints that haven't been seenbefore. So its not completely clear to me when the other methodsneed to call this callback, if at all. There needs to be amatching bmi_method_addr_forget_callback for eachbmi_method_addr_reg_callback, but if the method only registers asingle address per client, the list won't keep growing, unless weever plan to support millions of clients.
Hi Sam,
Thanks for tracking this down! I've been out of the office for afew days so I am a little behind on the conversation. I have acomment on the patch, though.
I think that for the most part, it is callingbmi_method_addr_forget_callback() on pretty much any tcp networkerror, since it is being invoked from the tcp_forget_addr()function which is a general purpose function.
This is fine on the server side, but I suspect (haven't been ableto confirm yet) that it would cause problems on the client side.Clients resolve a tcp://servername:3334 address into aPVFS_BMI_addr_t once and then hang onto it forever. If a networkerror occurs, then the state machines do not re-resolve it; theywill just retry communication on the same PVFS_BMI_addr_t, whichwill have been invalidated by the forget_callback() function.
Would it be better to only call bmi_method_addr_forget_callback()on addresses within bmi_tcp that were registered usingbmi_method_addr_reg_callback()? I haven't looked yet, but that mayrequire an extra flag somewhere to record which addresses thisapplies to. That way, the only person invalidating these things onerrors will be servers which have anonymous addresses that need tobe cleaned out. Clients with long lived address resolutions wouldnot be affected. It also makes a little more sense from an APIpoint of view if these two functions are companions that are calledin the same scenario.

Hi Phil,

I can add a server flag to the tcp addr struct, and only callforget_addr in that case, but it seems like a bit of hack. Can wejust toss the ref list in the bmi control layer and force methods tomanage their addresses? The ref list is only being used to map anopaque address pointer to a particular method. Could we just makethe address pointer not opaque and encode the method in there somehow?


-sam

By the way, I'm glad you are doing some work on hash tables forthese things- these linked lists have always bugged me :)
-Phil


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] Re: the halloween bug fixed

Reply via email to