[
https://issues.apache.org/jira/browse/HBASE-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Purtell reopened HBASE-9721:
-----------------------------------
This has caused persistent intermittent failures of
TestAssignmentManagerOnCluster on 0.98 branch. See HBASE-10341. We can address
the test failure on HBASE-10341 or I can revert this change on 0.98 branch.
> RegionServer should not accept regionOpen RPC intended for another(previous)
> server
> -----------------------------------------------------------------------------------
>
> Key: HBASE-9721
> URL: https://issues.apache.org/jira/browse/HBASE-9721
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.0, 0.96.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.98.0, 0.99.0
>
> Attachments: hbase-9721_v0.patch, hbase-9721_v1.patch,
> hbase-9721_v2.patch, hbase-9721_v3.patch
>
>
> On a test cluster, this following events happened with ITBLL and CM leading
> to meta being unavailable until master is restarted.
> An RS carrying meta died, and master assigned the region to one of the RSs.
> {code}
> 2013-10-03 23:30:06,611 INFO
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
> master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:06,611 INFO
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
> master.RegionStates: Transitioned {1588230740 state=OFFLINE,
> ts=1380843006601, server=null} to {1588230740 state=PENDING_OPEN,
> ts=1380843006611,
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> 2013-10-03 23:30:06,611 DEBUG
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
> master.ServerManager: New admin connection to
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> {code}
> At the same time, the RS that meta recently got assigned also died (due to
> CM), and restarted:
> {code}
> 2013-10-03 23:30:07,636 DEBUG [RpcServer.handler=17,port=60000]
> master.ServerManager: REPORT: Server
> gs-hdp2-secure-1380781860-hbase-8.cs1cloud.internal,60020,1380843002494 came
> back up, removed it from the dead servers list
> 2013-10-03 23:30:08,769 INFO [RpcServer.handler=18,port=60000]
> master.ServerManager: Triggering server recovery; existingServer
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 looks
> stale, new
> server:gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
> 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000]
> master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
>
> current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820,
> matches=true
> 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000]
> master.ServerManager:
> Added=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> to dead servers, submitted shutdown handler to be executed meta=true
> 2013-10-03 23:30:08,771 INFO [RpcServer.handler=18,port=60000]
> master.ServerManager: Registering
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
> 2013-10-03 23:30:08,772 INFO
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
> handler.MetaServerShutdownHandler: Splitting hbase:meta logs for
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> {code}
> AM/SSH sees that the RS that died was carrying meta, but the assignment RPC
> request was still not sent:
> {code}
> 2013-10-03 23:30:08,791 DEBUG
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
> master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
>
> current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820,
> matches=true
> 2013-10-03 23:30:08,791 INFO
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
> handler.MetaServerShutdownHandler: Server
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 was
> carrying META. Trying to assign.
> 2013-10-03 23:30:08,791 DEBUG
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
> master.RegionStates: Offline 1588230740 with current state=PENDING_OPEN,
> expected state=OFFLINE/SPLITTING/MERGING
> 2013-10-03 23:30:08,791 INFO
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
> master.RegionStates: Transitioned {1588230740 state=PENDING_OPEN,
> ts=1380843006611,
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> to {1588230740 state=OFFLINE, ts=1380843008791, server=null}
> 2013-10-03 23:30:09,809 INFO
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
> zookeeper.ZooKeeperNodeTracker: Unsetting hbase:meta region location in
> ZooKeeper
> {code}
> Our first attempt at the assign rpc fails, because the new server is now
> starting. The second attempt though succeeds:
> {code}
> 2013-10-03 23:30:10,621 INFO
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
> master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,621 INFO
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
> master.RegionStates: Transitioned {1588230740 state=OFFLINE,
> ts=1380843008791, server=null} to {1588230740 state=PENDING_OPEN,
> ts=1380843010621,
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> 2013-10-03 23:30:10,621 DEBUG
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1]
> master.ServerManager: New admin connection to
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,622 DEBUG [RpcServer.handler=22,port=60000]
> master.ServerManager: REPORT: Server
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 came
> back up, removed it from the dead servers list
> 2013-10-03 23:30:10,934 DEBUG
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2]
> master.AssignmentManager: Skip assigning {ENCODED => 1588230740, NAME =>
> 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}, it is already {1588230740
> state=PENDING_OPEN, ts=1380843010621,
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> {code}
> Note that the start time for the server does not match (1380842900820 is old
> , 1380843006362 is new).
> The region server got the rpc to open the region, but failed to change the zk
> state, because ServerNames is not matching:
> {code}
> 2013-10-03 23:30:10,601 INFO [Priority.RpcServer.handler=0,port=60020]
> regionserver.HRegionServer: Open hbase:meta,,1.1588230740
> 2013-10-03 23:30:10,897 DEBUG
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign:
> regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from
> M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
> 2013-10-03 23:30:10,918 WARN
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign:
> regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned
> node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed,
> the server that tried to transition was
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not
> the expected
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,918 WARN
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
> handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for
> region=1588230740
> 2013-10-03 23:30:10,919 WARN
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
> encodedName=1588230740
> 2013-10-03 23:30:10,919 INFO
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
> handler.OpenRegionHandler: Opening of region {ENCODED => 1588230740, NAME =>
> 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} failed, transitioning from
> OFFLINE to FAILED_OPEN in ZK, expecting version 0
> 2013-10-03 23:30:10,919 DEBUG
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign:
> regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from
> M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN
> 2013-10-03 23:30:10,921 WARN
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign:
> regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned
> node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN
> failed, the server that tried to transition was
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not
> the expected
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,921 WARN
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0]
> handler.OpenRegionHandler: Unable to mark region {ENCODED => 1588230740, NAME
> => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} as FAILED_OPEN. It's likely
> that the master already timed out this open attempt, and thus another RS
> already has the region.
> {code}
> It seems that the RS behaved correct by not being able to open the region by
> transitioning the zk assignment node. However, the master fails to timeout
> the assignment even though the meta region is reported in RIT:
> {code}
> 2013-10-04 00:14:50,658 DEBUG
> [gs-hdp2-secure-1380781860-hbase-12.cs1cloud.internal,60000,1380842679864-BalancerChore]
> master.HMaster: Not running balancer because 1 region(s) in transition:
> {1588230740={1588230740 state=PENDING_OPEN, ts=1380843010621,
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}}
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)