[ 
https://issues.apache.org/jira/browse/HBASE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738157#action_12738157
 ] 

Jim Kellerman commented on HBASE-1736:
--------------------------------------

Heartbeat and communication between Master and HRS in 0.21 should be greatly 
reduced, with region assignment happening via ZK, so this should be somewhat 
less of an issue at that time. However, I
agree that the HRS can continue so long as it does not need to split.

The problem is, what do we do when we reach that point?
- Put the cluster into "real-safe-mode" where no requests are accepted from 
clients?
- Make the cluster read-only?

Both of the above would exit that mode when a new master is available.

If we do region assignment more at the HRS level and use the master for just 
scanning for unassigned
regions so it can put them into a pool, the master becomes much less 
significant. The HRS could update the .META. and place the split children into 
the unassigned pool.


> If RS can't talk to master, pause; more importantly, don't split (Currently 
> we do and splits are lost and table is wounded)
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1736
>                 URL: https://issues.apache.org/jira/browse/HBASE-1736
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.20.1
>
>
> What I saw was master shutting itself down because it had lost zk lease.  
> Fine.   The RS though doesn't look like it can deal with this situation.    
> We'll see stuff like this:
> {code}
> ...failed on connection exception: java.net.ConnectException: Connection 
> refused
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:744)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
>     at $Proxy0.regionServerReport(Unknown Source)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:470)
>     at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.ConnectException: Connection refused
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:305)
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:826)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:707)
>     ... 4 more
> {code}
> ... all over the regionserver as it tries to send heartbeat to master on this 
> broken connection.
> On split, we close parent, add children to the catalog but then when we try 
> to tell the master about the split, it fails.  Means the children never get 
> deployed.  Meantime  the parent is offline.
> This issue is about going through the regionserver and anytime it has a 
> connection to master, make sure on fault that no damage is done the table and 
> then that the regionserver puts a pause on splitting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to