If RS can't talk to master, pause; more importantly, don't split (Currently we 
do and splits are lost and table is wounded)
---------------------------------------------------------------------------------------------------------------------------

                 Key: HBASE-1736
                 URL: https://issues.apache.org/jira/browse/HBASE-1736
             Project: Hadoop HBase
          Issue Type: Bug
            Reporter: stack
             Fix For: 0.20.1


What I saw was master shutting itself down because it had lost zk lease.  Fine. 
  The RS though doesn't look like it can deal with this situation.    We'll see 
stuff like this:

{code}
...failed on connection exception: java.net.ConnectException: Connection refused
    at 
org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:744)
    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
    at $Proxy0.regionServerReport(Unknown Source)
    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:470)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
    at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
    at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:305)
    at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:826)
    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:707)
    ... 4 more
{code}

... all over the regionserver as it tries to send heartbeat to master on this 
broken connection.

On split, we close parent, add children to the catalog but then when we try to 
tell the master about the split, it fails.  Means the children never get 
deployed.  Meantime  the parent is offline.

This issue is about going through the regionserver and anytime it has a 
connection to master, make sure on fault that no damage is done the table and 
then that the regionserver puts a pause on splitting.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to