[ 
https://issues.apache.org/jira/browse/HBASE-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rajeshbabu updated HBASE-9593:
------------------------------

    Description: 
In some of our tests we found that regionserer always showing online in master 
UI but its actually dead.
If region server went down in the middle following steps then the region server 
always showing in master online servers list.
1) register to master
2) create  ephemeral znode

Since no notification from zookeeper, master is not removing the expired server 
from online servers list.
Assignments will fail if the RS is selected as destination server.
Some cases ROOT or META also wont be assigned if the RS is randomly selected 
every time need to wait for timeout.

Here are the logs:
1) HOST-10-18-40-153 is registered to master
{code}
2013-09-19 19:47:41,123 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
STARTUP: Server HOST-10-18-40-153,61020,1379600260255 came back up, removed it 
from the dead servers list
2013-09-19 19:47:41,123 INFO org.apache.hadoop.hbase.master.ServerManager: 
Registering server=HOST-10-18-40-153,61020,1379600260255
{code}
{code}
2013-09-19 19:47:41,119 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
HOST-10-18-40-153/10.18.40.153:61000
2013-09-19 19:47:41,119 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
HOST-10-18-40-153,61000,1379600055284 that we are up with port=61020, 
startcode=1379600260255
{code}
2) Terminated before creating ephemeral node.
{code}
Thu Sep 19 19:47:41 IST 2013 Terminating regionserver
{code}
3) The RS can be selected for assignment and they will fail.
{code}
2013-09-19 19:47:54,049 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Failed assignment of -ROOT-,,0.70236052 to 
HOST-10-18-40-153,61020,1379600260255, trying to assign elsewhere instead; 
retry=0
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:390)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:436)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1127)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
        at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
        at $Proxy15.openRegion(Unknown Source)
        at 
org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:533)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1734)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1431)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1406)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1401)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:2374)
        at 
org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignRoot(MetaServerShutdownHandler.java:136)
        at 
org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignRootWithRetries(MetaServerShutdownHandler.java:160)
        at 
org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:82)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2013-09-19 19:47:54,050 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Found an existing plan for -ROOT-,,0.70236052 destination server is 
HOST-10-18-40-153,61020,1379600260255
2013-09-19 19:47:54,050 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
No previous transition plan was found (or we are ignoring an existing plan) for 
-ROOT-,,0.70236052 so generated a random one; hri=-ROOT-,,0.70236052, src=, 
dest=HOST-10-18-40-153,61020,1379600260255; 1 (online=1, available=1) available 
servers
2013-09-19 19:47:54,050 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:61000-0x14135a277ff017d Creating (or updating) unassigned node for 
70236052 with OFFLINE state
2013-09-19 19:47:54,070 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=M_ZK_REGION_OFFLINE, 
server=HOST-10-18-40-153,61000,1379600055284, region=70236052/-ROOT-
2013-09-19 19:47:54,071 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Found an existing plan for -ROOT-,,0.70236052 destination server is 
HOST-10-18-40-153,61020,1379600260255
2013-09-19 19:47:54,071 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Using pre-existing plan for region -ROOT-,,0.70236052; 
plan=hri=-ROOT-,,0.70236052, src=, dest=HOST-10-18-40-153,61020,1379600260255
2013-09-19 19:47:54,071 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region -ROOT-,,0.70236052 to HOST-10-18-40-153,61020,1379600260255
2013-09-19 19:47:54,072 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Failed assignment of -ROOT-,,0.70236052 to 
HOST-10-18-40-153,61020,1379600260255, trying to assign elsewhere instead; 
retry=1
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is 
in the failed servers list: HOST-10-18-40-153/10.18.40.153:61020
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1127)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
        at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
        at $Proxy15.openRegion(Unknown Source)
        at 
org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:533)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1734)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1431)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1406)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1401)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:2374)
        at 
org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignRoot(MetaServerShutdownHandler.java:136)
        at 
org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignRootWithRetries(MetaServerShutdownHandler.java:160)
        at 
org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:82)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2013-09-19 19:47:54,072 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Found an existing plan for -ROOT-,,0.70236052 destination server is 
HOST-10-18-40-153,61020,1379600260255
{code}

  was:
In some of our tests we found that regionserer always showing online in master 
UI but its actually dead.
If region server went down in the middle following steps then the region server 
always showing in master online servers list.
1) register to master
2) create  ephemeral znode

Since no notification from zookeeper, master is not removing the expired server.
Assignments also failing if the RS is selected as destination server.
Some cases 
 

    
> Region server left in online regionservers list if the region server went 
> down after registering to master and before creating ephemeral node
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-9593
>                 URL: https://issues.apache.org/jira/browse/HBASE-9593
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.94.11
>            Reporter: rajeshbabu
>            Assignee: rajeshbabu
>
> In some of our tests we found that regionserer always showing online in 
> master UI but its actually dead.
> If region server went down in the middle following steps then the region 
> server always showing in master online servers list.
> 1) register to master
> 2) create  ephemeral znode
> Since no notification from zookeeper, master is not removing the expired 
> server from online servers list.
> Assignments will fail if the RS is selected as destination server.
> Some cases ROOT or META also wont be assigned if the RS is randomly selected 
> every time need to wait for timeout.
> Here are the logs:
> 1) HOST-10-18-40-153 is registered to master
> {code}
> 2013-09-19 19:47:41,123 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> STARTUP: Server HOST-10-18-40-153,61020,1379600260255 came back up, removed 
> it from the dead servers list
> 2013-09-19 19:47:41,123 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Registering server=HOST-10-18-40-153,61020,1379600260255
> {code}
> {code}
> 2013-09-19 19:47:41,119 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 
> HOST-10-18-40-153/10.18.40.153:61000
> 2013-09-19 19:47:41,119 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
> HOST-10-18-40-153,61000,1379600055284 that we are up with port=61020, 
> startcode=1379600260255
> {code}
> 2) Terminated before creating ephemeral node.
> {code}
> Thu Sep 19 19:47:41 IST 2013 Terminating regionserver
> {code}
> 3) The RS can be selected for assignment and they will fail.
> {code}
> 2013-09-19 19:47:54,049 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
> -ROOT-,,0.70236052 to HOST-10-18-40-153,61020,1379600260255, trying to assign 
> elsewhere instead; retry=0
> java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:390)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:436)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1127)
>       at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
>       at 
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
>       at $Proxy15.openRegion(Unknown Source)
>       at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:533)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1734)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1431)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1406)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1401)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:2374)
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignRoot(MetaServerShutdownHandler.java:136)
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignRootWithRetries(MetaServerShutdownHandler.java:160)
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:82)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> 2013-09-19 19:47:54,050 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for 
> -ROOT-,,0.70236052 destination server is HOST-10-18-40-153,61020,1379600260255
> 2013-09-19 19:47:54,050 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
> was found (or we are ignoring an existing plan) for -ROOT-,,0.70236052 so 
> generated a random one; hri=-ROOT-,,0.70236052, src=, 
> dest=HOST-10-18-40-153,61020,1379600260255; 1 (online=1, available=1) 
> available servers
> 2013-09-19 19:47:54,050 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:61000-0x14135a277ff017d Creating (or updating) unassigned node for 
> 70236052 with OFFLINE state
> 2013-09-19 19:47:54,070 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=M_ZK_REGION_OFFLINE, server=HOST-10-18-40-153,61000,1379600055284, 
> region=70236052/-ROOT-
> 2013-09-19 19:47:54,071 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for 
> -ROOT-,,0.70236052 destination server is HOST-10-18-40-153,61020,1379600260255
> 2013-09-19 19:47:54,071 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
> region -ROOT-,,0.70236052; plan=hri=-ROOT-,,0.70236052, src=, 
> dest=HOST-10-18-40-153,61020,1379600260255
> 2013-09-19 19:47:54,071 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
> -ROOT-,,0.70236052 to HOST-10-18-40-153,61020,1379600260255
> 2013-09-19 19:47:54,072 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
> -ROOT-,,0.70236052 to HOST-10-18-40-153,61020,1379600260255, trying to assign 
> elsewhere instead; retry=1
> org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is 
> in the failed servers list: HOST-10-18-40-153/10.18.40.153:61020
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
>       at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1127)
>       at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
>       at 
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
>       at $Proxy15.openRegion(Unknown Source)
>       at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:533)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1734)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1431)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1406)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1401)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:2374)
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignRoot(MetaServerShutdownHandler.java:136)
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignRootWithRetries(MetaServerShutdownHandler.java:160)
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:82)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> 2013-09-19 19:47:54,072 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for 
> -ROOT-,,0.70236052 destination server is HOST-10-18-40-153,61020,1379600260255
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to