Could you clarify what you mean by 'network partition' ? First blush... not a good idea.
Can you please explain how a laptop suspend is an issue for those running HBase in a production environment? I really want to encourage the committers to think more like product owners because HBase is now being presented to the enterprise by Cloudera and Amazon. (MapR has M7 and I haven't seen either IBM or Pivotal in the wild.) On Jun 4, 2014, at 10:30 AM, Cosmin Lehene <[email protected]> wrote: > Thanks Andy, > > This is in lines with what I was asking/thinking of. > > To separate concerns a bit: > The question is if HRegionSever instead of exiting could have a standby > state (maintaining the current region state would be optional depending > whether we have shadow regions or not) where it doesn¹t serve requests, > but still visible to the cluster. This way we could ³wake it up² if we > decide that is safe (e.g. we remain consistent). > > This scenarios seems to solve both my laptop standby problem as well as > (potentially fast) recovery after a network partition. > > Cosmin > > > > On 6/3/14, 3:42 PM, "Andrew Purtell" <[email protected]> wrote: > >> Although a dev lapop suspend and resume is equivalent to a total network >> failure for the elapsed time, with all that implies, perhaps after >> HBASE-10070 goes in it could be possible for a RS to stay up with regions >> switched to read only replica mode, reconnect, and perhaps the master upon >> discovering all regions still available on the cluster but in read only >> mode (no primary), it could reassign primaries - the net effect being >> indeed no process needs restarting by a supervisor. >> >> >> On Tue, Jun 3, 2014 at 3:27 PM, Andrew Purtell <[email protected]> >> wrote: >> >>> Stating the obvious you just need to restart the RegionServer because it >>> shut down. We use ZooKeeper for tracking server liveness and from the >>> ZooKeeper perspective a sufficiently long time elapsed without heartbeat >>> such that the RegionServer's session expired. We've left this option to >>> date to the user to do with supervisory scripts, e.g. Puppet / Chef / >>> Daemontools. I suppose a RegionServer could try and reinitialize as a >>> new >>> process or the ./bin/hbase script could do this if you ask. >>> >>> >>> On Tue, Jun 3, 2014 at 2:46 PM, Cosmin Lehene <[email protected]> wrote: >>> >>>> I just realized that, for years, I've been countlessly restarted hbase >>>> every time my laptop gets out of standby. >>>> I know well why I do this, but I also know I could probably not do it >>>> and that I don't have to do with Hadoop or Zookeeper or other >>>> services and >>>> I wish I wouldn't need to with Hbase either. >>>> >>>> So short term, I'd like to know if there a better way already. >>>> >>>> Long term I think this is a bigger, more fundamental resiliency aspect >>>> that perhaps is not trivial, but probably worth thinking about in the >>>> real >>>> deployments context and I wonder if there's something that tries to >>>> solve >>>> this already. >>>> >>>> Thoughts? >>>> >>>> Thanks, >>>> Cosmin >>>> >>> >>> >>> >>> -- >>> Best regards, >>> >>> - Andy >>> >>> Problems worthy of attack prove their worth by hitting back. - Piet Hein >>> (via Tom White) >>> >> >> >> >> -- >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) > >
