[jira] Updated: (HBASE-1736) If RS can't talk to master, pause; more importantly, don't split (Currently we do and splits are lost and table is wounded)

stack (JIRA) Tue, 09 Nov 2010 13:40:50 -0800

     [ 
https://issues.apache.org/jira/browse/HBASE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-1736:
-------------------------

    Fix Version/s:     (was: 0.90.0)

Moving out of 0.90.0.  If master is gone when we go to split, we'll just keep 
trying to send the split message.   For ever.  If new splits meantime, should 
get added to messages to pass the master (TODO: verify).  When new master comes 
up, he'll get the split message and update internal state.  New master will 
have visited zk and .META. to establish internal state as part of joining 
cluster.  The messages should align with what master found in .META. and zk.

All above is speculation.  Needs a test but  doesn't need to happen for 
0.90.0RC.

> If RS can't talk to master, pause; more importantly, don't split (Currently 
> we do and splits are lost and table is wounded)
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1736
>                 URL: https://issues.apache.org/jira/browse/HBASE-1736
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>
> What I saw was master shutting itself down because it had lost zk lease.  
> Fine.   The RS though doesn't look like it can deal with this situation.    
> We'll see stuff like this:
> {code}
> ...failed on connection exception: java.net.ConnectException: Connection 
> refused
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:744)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
>     at $Proxy0.regionServerReport(Unknown Source)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:470)
>     at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.ConnectException: Connection refused
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:305)
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:826)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:707)
>     ... 4 more
> {code}
> ... all over the regionserver as it tries to send heartbeat to master on this 
> broken connection.
> On split, we close parent, add children to the catalog but then when we try 
> to tell the master about the split, it fails.  Means the children never get 
> deployed.  Meantime  the parent is offline.
> This issue is about going through the regionserver and anytime it has a 
> connection to master, make sure on fault that no damage is done the table and 
> then that the regionserver puts a pause on splitting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1736) If RS can't talk to master, pause; more importantly, don't split (Currently we do and splits are lost and table is wounded)

Reply via email to