Why did the backup master's zookeeper session expire? That indicates a problem somewhere on the network or with zookeeper.
The active master and regionservers also shut down when their sessions expire. If our zookeeper session expires we have been partitioned and have a high degree of uncertainty from our vantage point on the state of the world. We shut down to avoid accidentally taking incorrect actions with bad or out of date state. This simplifies design and removes corner cases. In a production environment I would expect a site local strategy (could be daemontools etc.) for automatic service recovery, if that is desired. On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng <[email protected]>wrote: > Dear Devs, > > Now I encounter a problem in the HMaster. > Currently I run multiple HMasters in a cluster. If the ZK connection of > one of the backup HMasters expires, this backup HMaster will go down > directly without recovering the ZK connection. > I saw there were such code in the HMaster.abortNow() listed below, the > fail.fast only works for active HMaster. Do the backup ones need to be > recovered if the zk connection expires? Please advise. Thanks. > > if (!this.isActiveMaster || this.stopped) { > return true; > } > boolean failFast = conf.getBoolean("fail.fast.expired.active.master", > false); > > > Regards, > Jingcheng > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
