Eitan Zahavi wrote:
[EZ] The race is happening when the SM received the request and
responded but the other SMs or the file system did not fully stored that
registration and the SM crashed.

If the client received a response that the join was successful, then I consider that an SM issue.

The problem is that the SM lost *its* state information. Requiring end nodes to maintain the SM's state for it still doesn't make sense to me. Your converting an SM issue into a requirement that all end nodes must support for proper operation.

Why can't the local system store the same data in another process? (E.g. record all join MADs that have been processed by the SM.) Why can't that data be saved to disk? Why can't some other arbitrary system in the fabric save that data?

I still believe that there are a lot of potential solutions to this problem than requiring end nodes to maintain the SM's state.

- Sean
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to