[jira] Commented: (HBASE-1736) If RS can't talk to master, pause; more importantly, don't split (Currently we do and splits are lost and table is wounded)

stack (JIRA) Sun, 02 Aug 2009 17:47:38 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738162#action_12738162
 ]


stack commented on HBASE-1736:
------------------------------

So, I notice that RS has a watcher on master.  We got this:

{code}
2009-08-01 19:29:38,018 [main-EventThread] INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: 
SyncConnected, type: NodeDeleted, path: /hbase/master
{code}
 
.. but all we do is reset the watcher:

{code}
    } else if (type == EventType.NodeDeleted) {
      watchMasterAddress();
{code}

We should set a flag that stops splitting -- take out the 
CompactSplitThread#lock -- until we get NodeCreated (NodeCreated does 
getMaster() ... could release lock too...).  Holding lock would hold up the 
CompactSplitThead... it does compactions too... probably not whats wanted.

> If RS can't talk to master, pause; more importantly, don't split (Currently 
> we do and splits are lost and table is wounded)
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1736
>                 URL: https://issues.apache.org/jira/browse/HBASE-1736
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.20.1
>
>
> What I saw was master shutting itself down because it had lost zk lease.  
> Fine.   The RS though doesn't look like it can deal with this situation.    
> We'll see stuff like this:
> {code}
> ...failed on connection exception: java.net.ConnectException: Connection 
> refused
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:744)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
>     at $Proxy0.regionServerReport(Unknown Source)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:470)
>     at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.ConnectException: Connection refused
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:305)
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:826)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:707)
>     ... 4 more
> {code}
> ... all over the regionserver as it tries to send heartbeat to master on this 
> broken connection.
> On split, we close parent, add children to the catalog but then when we try 
> to tell the master about the split, it fails.  Means the children never get 
> deployed.  Meantime  the parent is offline.
> This issue is about going through the regionserver and anytime it has a 
> connection to master, make sure on fault that no damage is done the table and 
> then that the regionserver puts a pause on splitting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1736) If RS can't talk to master, pause; more importantly, don't split (Currently we do and splits are lost and table is wounded)

Reply via email to