Eitan Zahavi wrote:
Leonid just sent an example for a race that might happen if the SM is to be the maintainer of the data.
The race Leonid mentioned is a client sending a request when the SM is down. That request will fail, so there's no data for the SM to maintain for that node. That's a retry condition that the client must deal with.
[EZ] The SM is a single entity that has to respond to all requests from the entire cluster. (Even redirection requests). When you require that SM to also provide transaction safe storage or even worse then that consistency with multiple standby SMs you worsen the problem. The clients on the their side only need to maintain their own registrations.
I don't believe that there's any requirement that the SM be a single system. But I do believe that the SM should be able to recover from all SM problems without interrupting any existing communication that is occurring the fabric. SM failover or failure/restart should be as transparent to the clients (i.e the non-SM nodes in the fabric) as possible. (Btw, I also believe that the SM should run on top of a real DBMS and support SQL style queries...)
You don't want to push this problem to every application running in the fabric, so why even push it to every node in the fabric?
- Sean _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
