Enis Soztutar created HBASE-9721:
------------------------------------

             Summary: meta assignment did not timeout
                 Key: HBASE-9721
                 URL: https://issues.apache.org/jira/browse/HBASE-9721
             Project: HBase
          Issue Type: Bug
            Reporter: Enis Soztutar
            Assignee: Enis Soztutar
             Fix For: 0.98.0, 0.96.1


On a test cluster, this following events happened with ITBLL and CM leading to 
meta being unavailable until master is restarted. 

An RS carrying meta died, and master assigned the region to one of the RSs. 
{code}
2013-10-03 23:30:06,611 INFO  
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
2013-10-03 23:30:06,611 INFO  
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
master.RegionStates: Transitioned {1588230740 state=OFFLINE, ts=1380843006601, 
server=null} to {1588230740 state=PENDING_OPEN, ts=1380843006611, 
server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
2013-10-03 23:30:06,611 DEBUG 
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
master.ServerManager: New admin connection to 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
{code}

At the same time, the RS that meta recently got assigned also died (due to CM), 
and restarted: 
{code}
2013-10-03 23:30:07,636 DEBUG [RpcServer.handler=17,port=60000] 
master.ServerManager: REPORT: Server 
gs-hdp2-secure-1380781860-hbase-8.cs1cloud.internal,60020,1380843002494 came 
back up, removed it from the dead servers list
2013-10-03 23:30:08,769 INFO  [RpcServer.handler=18,port=60000] 
master.ServerManager: Triggering server recovery; existingServer 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 looks 
stale, new 
server:gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] 
master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk 
server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 
current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820,
 matches=true
2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] 
master.ServerManager: 
Added=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 
to dead servers, submitted shutdown handler to be executed meta=true
2013-10-03 23:30:08,771 INFO  [RpcServer.handler=18,port=60000] 
master.ServerManager: Registering 
server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
2013-10-03 23:30:08,772 INFO  
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
handler.MetaServerShutdownHandler: Splitting hbase:meta logs for 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
{code}

AM/SSH sees that the RS that died was carrying meta, but the assignment RPC 
request was still not sent:
{code}
2013-10-03 23:30:08,791 DEBUG 
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk 
server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 
current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820,
 matches=true
2013-10-03 23:30:08,791 INFO  
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
handler.MetaServerShutdownHandler: Server 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 was 
carrying META. Trying to assign.
2013-10-03 23:30:08,791 DEBUG 
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
master.RegionStates: Offline 1588230740 with current state=PENDING_OPEN, 
expected state=OFFLINE/SPLITTING/MERGING
2013-10-03 23:30:08,791 INFO  
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
master.RegionStates: Transitioned {1588230740 state=PENDING_OPEN, 
ts=1380843006611, 
server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820} 
to {1588230740 state=OFFLINE, ts=1380843008791, server=null}
2013-10-03 23:30:09,809 INFO  
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
zookeeper.ZooKeeperNodeTracker: Unsetting hbase:meta region location in 
ZooKeeper
{code}

Our first attempt at the assign rpc fails, because the new server is now 
starting. The second attempt though succeeds: 
{code}
2013-10-03 23:30:10,621 INFO  
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
2013-10-03 23:30:10,621 INFO  
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
master.RegionStates: Transitioned {1588230740 state=OFFLINE, ts=1380843008791, 
server=null} to {1588230740 state=PENDING_OPEN, ts=1380843010621, 
server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
2013-10-03 23:30:10,621 DEBUG 
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
master.ServerManager: New admin connection to 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
2013-10-03 23:30:10,622 DEBUG [RpcServer.handler=22,port=60000] 
master.ServerManager: REPORT: Server 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 came 
back up, removed it from the dead servers list
2013-10-03 23:30:10,934 DEBUG 
[MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
master.AssignmentManager: Skip assigning {ENCODED => 1588230740, NAME => 
'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}, it is already {1588230740 
state=PENDING_OPEN, ts=1380843010621, 
server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
{code}

Note that the start time for the server does not match (1380842900820 is old , 
1380843006362 is new). 

The region server got the rpc to open the region, but failed to change the zk 
state, because ServerNames is not matching: 
{code}
2013-10-03 23:30:10,601 INFO  [Priority.RpcServer.handler=0,port=60020] 
regionserver.HRegionServer: Open hbase:meta,,1.1588230740
2013-10-03 23:30:10,897 DEBUG 
[RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: 
regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from 
M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
2013-10-03 23:30:10,918 WARN  
[RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: 
regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned node 
for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the 
server that tried to transition was 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not the 
expected gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
2013-10-03 23:30:10,918 WARN  
[RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] 
handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for 
region=1588230740
2013-10-03 23:30:10,919 WARN  
[RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] 
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for 
encodedName=1588230740
2013-10-03 23:30:10,919 INFO  
[RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] 
handler.OpenRegionHandler: Opening of region {ENCODED => 1588230740, NAME => 
'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} failed, transitioning from 
OFFLINE to FAILED_OPEN in ZK, expecting version 0
2013-10-03 23:30:10,919 DEBUG 
[RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: 
regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from 
M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN
2013-10-03 23:30:10,921 WARN  
[RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: 
regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned node 
for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN failed, the 
server that tried to transition was 
gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not the 
expected gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
2013-10-03 23:30:10,921 WARN  
[RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] 
handler.OpenRegionHandler: Unable to mark region {ENCODED => 1588230740, NAME 
=> 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} as FAILED_OPEN. It's likely 
that the master already timed out this open attempt, and thus another RS 
already has the region.
{code}

It seems that the RS behaved correct by not being able to open the region by 
transitioning the zk assignment node. However, the master fails to timeout the 
assignment even though the meta region is reported in RIT: 
{code}
2013-10-04 00:14:50,658 DEBUG 
[gs-hdp2-secure-1380781860-hbase-12.cs1cloud.internal,60000,1380842679864-BalancerChore]
 master.HMaster: Not running balancer because 1 region(s) in transition: 
{1588230740={1588230740 state=PENDING_OPEN, ts=1380843010621, 
server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}}
{code}




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to