Re: [Pvfs2-developers] Re: the halloween bug fixed

Phil Carns Thu, 11 Oct 2007 08:16:52 -0700

Sam Lang wrote:

The attached patch is the proposed fix for this problem. When the tcpmethod receives a disconnect from a peer, it invokes a callback(bmi_method_addr_forget_callback) into the bmi control layer to removethe address reference from the list. Maybe I should also add a counterand limit on how bit the list can get, al though that would involvepotentially forcing long-lived connections to reconnect periodically,and all methods would have to implement BMI_set_info (DROP_ADDR).
With tcp, new connections are registered, even if they are from thesame host/port on the peer, whereas the other methods seem to onlyregister new host/port endpoints that haven't been seen before. So itsnot completely clear to me when the other methods need to call thiscallback, if at all. There needs to be a matchingbmi_method_addr_forget_callback for each bmi_method_addr_reg_callback,but if the method only registers a single address per client, the listwon't keep growing, unless we ever plan to support millions of clients.


Hi Sam,

Thanks for tracking this down! I've been out of the office for a fewdays so I am a little behind on the conversation. I have a comment onthe patch, though.

I think that for the most part, it is callingbmi_method_addr_forget_callback() on pretty much any tcp network error,since it is being invoked from the tcp_forget_addr() function which is ageneral purpose function.

This is fine on the server side, but I suspect (haven't been able toconfirm yet) that it would cause problems on the client side. Clientsresolve a tcp://servername:3334 address into a PVFS_BMI_addr_t once andthen hang onto it forever. If a network error occurs, then the statemachines do not re-resolve it; they will just retry communication on thesame PVFS_BMI_addr_t, which will have been invalidated by theforget_callback() function.

Would it be better to only call bmi_method_addr_forget_callback() onaddresses within bmi_tcp that were registered usingbmi_method_addr_reg_callback()? I haven't looked yet, but that mayrequire an extra flag somewhere to record which addresses this appliesto. That way, the only person invalidating these things on errors willbe servers which have anonymous addresses that need to be cleaned out.Clients with long lived address resolutions would not be affected. Italso makes a little more sense from an API point of view if these twofunctions are companions that are called in the same scenario.

By the way, I'm glad you are doing some work on hash tables for thesethings- these linked lists have always bugged me :)


-Phil

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] Re: the halloween bug fixed

Reply via email to