I agree that resuming the process is best handled by site-local tooling. Could be we do a better job of informing that tooling regarding the nature of the failure. Well defined exit codes, for instance, may be useful.
On Thursday, March 20, 2014, Du, Jingcheng <[email protected]> wrote: > Thanks a lot for the comments. > > I think we could have another service or supervisor to bring the backup > masters back when they go down. > > Regards, > Jingcheng > > -----Original Message----- > From: ramkrishna vasudevan > [mailto:[email protected]<javascript:;> > ] > Sent: Friday, March 21, 2014 12:02 PM > To: [email protected] <javascript:;> > Subject: Re: Backup HMasters will go down if the zk connection expires > without recovery > > We discussed this internally too. May be the intention was to see if > through code it can be handled. Generally the management of these back up > master can be done outside of HBase through monitoring services. > @Jingcheng > What do you think? > > Regards > Ram > > > On Fri, Mar 21, 2014 at 3:25 AM, Enis Söztutar > <[email protected]<javascript:;>> > wrote: > > > Zk session recovery in the active master was added some time ago, but > > it requires a complex state management in regards to what services > > inside master to reinitialize or keep. We discussed that we should > > remove it altogether since this increases the code complexity by a > > lot, and makes the recovery from zk session lost very error prone (a > > remember 1-2 issues fixing this area). > > > > I think architecturally, we remove zk session recovery from active > > master, and not add this to backup masters at all. Another service, > > like Ambari, or a supervisor should be responsible to bring the master > > / backup master nodes back. > > > > Enis > > > > > > On Thu, Mar 20, 2014 at 11:35 AM, Andrew Purtell > > <[email protected]<javascript:;> > > >wrote: > > > > > Why did the backup master's zookeeper session expire? That indicates > > > a problem somewhere on the network or with zookeeper. > > > > > > The active master and regionservers also shut down when their > > > sessions expire. If our zookeeper session expires we have been > > > partitioned and > > have > > > a high degree of uncertainty from our vantage point on the state of > > > the world. We shut down to avoid accidentally taking incorrect > > > actions with > > bad > > > or out of date state. This simplifies design and removes corner cases. > > In > > > a production environment I would expect a site local strategy (could > > > be daemontools etc.) for automatic service recovery, if that is > desired. > > > > > > > > > > > > On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng > > > <[email protected] <javascript:;> > > > >wrote: > > > > > > > Dear Devs, > > > > > > > > Now I encounter a problem in the HMaster. > > > > Currently I run multiple HMasters in a cluster. If the ZK > > > > connection > > of > > > > one of the backup HMasters expires, this backup HMaster will go > > > > down directly without recovering the ZK connection. > > > > I saw there were such code in the HMaster.abortNow() listed below, > > > > the fail.fast only works for active HMaster. Do the backup ones > > > > need to be recovered if the zk connection expires? Please advise. > Thanks. > > > > > > > > if (!this.isActiveMaster || this.stopped) { > > > > return true; > > > > } > > > > boolean failFast = > > > > conf.getBoolean("fail.fast.expired.active.master", > > > > false); > > > > > > > > > > > > Regards, > > > > Jingcheng > > > > > > > > > > > > > > > > -- > > > Best regards, > > > > > > - Andy > > > > > > Problems worthy of attack prove their worth by hitting back. - Piet > > > Hein (via Tom White) > > > > > >
