Thanks for tracking this down! I've been out of the office for a few days so I am a little behind on the conversation. I have a comment on the patch, though.

I think that for the most part, it is calling bmi_method_addr_forget_callback() on pretty much any tcp network error, since it is being invoked from the tcp_forget_addr() function which is a general purpose function.

This is fine on the server side, but I suspect (haven't been able to confirm yet) that it would cause problems on the client side. Clients resolve a tcp://servername:3334 address into a PVFS_BMI_addr_t once and then hang onto it forever. If a network error occurs, then the state machines do not re-resolve it; they will just retry communication on the same PVFS_BMI_addr_t, which will have been invalidated by the forget_callback() function.

Would it be better to only call bmi_method_addr_forget_callback() on addresses within bmi_tcp that were registered using bmi_method_addr_reg_callback()? I haven't looked yet, but that may require an extra flag somewhere to record which addresses this applies to. That way, the only person invalidating these things on errors will be servers which have anonymous addresses that need to be cleaned out. Clients with long lived address resolutions would not be affected. It also makes a little more sense from an API point of view if these two functions are companions that are called in the same scenario.

Hi Phil,

I can add a server flag to the tcp addr struct, and only call forget_addr in that case, but it seems like a bit of hack.

Actually, one more follow up on this specific patch to wrap up. The problem that I suspected actually does not occur, although it may have been by luck :) There is a tcp_addr_data->bmi_addr field that gets used as an argument to the forget_callback() function. That bmi_addr field is set to zero unless it is filled in by the addr_reg_callback() function. That means that on the client side, the forget_callback() function actually just fails because it searches for a BMI_addr_t of value 0 in the reference list.

I just tested it out, and pvfs2-client-core was able to recover from a network error just fine with the patch in place.

-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to