Most HA frameworks will failover when they lose leadership. I don't understand the semantics of SUSPENDED in curator, but the conservative thing to do would be to failover when this occurs if it means that you should not be taking any leader-related actions. This will really depend on the Curator leader election semantics though.
The semantics of SUSPENDED seem a bit strange to me as it says "There has been a loss of connection. Leaders, locks, etc. should suspend until the connection is re-established. If the connection times-out you will receive a LOST notice". One must assume that they *may* have lost leadership from this description. Have you reached out to the Curator devs? You can't stop() and then re-start() the driver (although the names would lead one to think otherwise). On Wed, Apr 16, 2014 at 11:23 AM, David Greenberg <[email protected]>wrote: > Hello Mesos devs, > I'm trying to integrate leader election into the framework I already wrote. > The framework uses a shared database, and will be fine as long as at most > one copy is running at a given time. I am using Curator to provide the > leader election. What I'm not sure about is how to handle when the curator > goes into a SUSPENDED state--I'd like to stop the driver, and restart it > once the connection RECONNECTs, but I'm not sure if I can do that without > creating a whole new MesosSchedulerDriver. Is this the right way to go > about it? Should I just ignore SUSPENDED? What do other frameworks do? >
