Zk session recovery in the active master was added some time ago, but it requires a complex state management in regards to what services inside master to reinitialize or keep. We discussed that we should remove it altogether since this increases the code complexity by a lot, and makes the recovery from zk session lost very error prone (a remember 1-2 issues fixing this area).
I think architecturally, we remove zk session recovery from active master, and not add this to backup masters at all. Another service, like Ambari, or a supervisor should be responsible to bring the master / backup master nodes back. Enis On Thu, Mar 20, 2014 at 11:35 AM, Andrew Purtell <[email protected]>wrote: > Why did the backup master's zookeeper session expire? That indicates a > problem somewhere on the network or with zookeeper. > > The active master and regionservers also shut down when their sessions > expire. If our zookeeper session expires we have been partitioned and have > a high degree of uncertainty from our vantage point on the state of the > world. We shut down to avoid accidentally taking incorrect actions with bad > or out of date state. This simplifies design and removes corner cases. In > a production environment I would expect a site local strategy (could be > daemontools etc.) for automatic service recovery, if that is desired. > > > > On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng <[email protected] > >wrote: > > > Dear Devs, > > > > Now I encounter a problem in the HMaster. > > Currently I run multiple HMasters in a cluster. If the ZK connection of > > one of the backup HMasters expires, this backup HMaster will go down > > directly without recovering the ZK connection. > > I saw there were such code in the HMaster.abortNow() listed below, the > > fail.fast only works for active HMaster. Do the backup ones need to be > > recovered if the zk connection expires? Please advise. Thanks. > > > > if (!this.isActiveMaster || this.stopped) { > > return true; > > } > > boolean failFast = conf.getBoolean("fail.fast.expired.active.master", > > false); > > > > > > Regards, > > Jingcheng > > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >
