[ https://issues.apache.org/jira/browse/HBASE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738162#action_12738162 ]
stack commented on HBASE-1736: ------------------------------ So, I notice that RS has a watcher on master. We got this: {code} 2009-08-01 19:29:38,018 [main-EventThread] INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDeleted, path: /hbase/master {code} .. but all we do is reset the watcher: {code} } else if (type == EventType.NodeDeleted) { watchMasterAddress(); {code} We should set a flag that stops splitting -- take out the CompactSplitThread#lock -- until we get NodeCreated (NodeCreated does getMaster() ... could release lock too...). Holding lock would hold up the CompactSplitThead... it does compactions too... probably not whats wanted. > If RS can't talk to master, pause; more importantly, don't split (Currently > we do and splits are lost and table is wounded) > --------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-1736 > URL: https://issues.apache.org/jira/browse/HBASE-1736 > Project: Hadoop HBase > Issue Type: Bug > Reporter: stack > Fix For: 0.20.1 > > > What I saw was master shutting itself down because it had lost zk lease. > Fine. The RS though doesn't look like it can deal with this situation. > We'll see stuff like this: > {code} > ...failed on connection exception: java.net.ConnectException: Connection > refused > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:744) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > at $Proxy0.regionServerReport(Unknown Source) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:470) > at java.lang.Thread.run(Unknown Source) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:305) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:826) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:707) > ... 4 more > {code} > ... all over the regionserver as it tries to send heartbeat to master on this > broken connection. > On split, we close parent, add children to the catalog but then when we try > to tell the master about the split, it fails. Means the children never get > deployed. Meantime the parent is offline. > This issue is about going through the regionserver and anytime it has a > connection to master, make sure on fault that no damage is done the table and > then that the regionserver puts a pause on splitting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.