Jeffrey Zhong created HBASE-9665:
------------------------------------
Summary: Region gets lost when balancer & SSH both trying to
assign
Key: HBASE-9665
URL: https://issues.apache.org/jira/browse/HBASE-9665
Project: HBase
Issue Type: Bug
Components: Region Assignment
Affects Versions: 0.96.0
Reporter: Jeffrey Zhong
Priority: Critical
In summary, a server dies and its regions are re-assigned. While right before
SSH, balancer is starting assign one region on the server to somewhere.
The balancer assignment got preempted by the SSH assignment:
{code}
2013-09-25 11:55:32,854 INFO Priority.RpcServer.handler=7,port=60020
regionserver.HRegionServer: Received CLOSE for the
region:6deb1bfefe8cbdb443084efe919fdeb7 , which we are already trying to OPEN.
Cancelling OPENING.
{code}
The SSH assignment(by GeneralBulkAssigner) failed too due to:
{code}
2013-09-25 11:55:32,927 WARN [RS_OPEN_REGION-hor15n09:60020-2]
zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0 Attempt to transition
the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from
M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to
transition was hor15n09.gq1.ygridcore.net,60020,1380109280320 not the expected
hor15n07.gq1.ygridcore.net,60020,1380109890414
{code}
In the end, the region 6deb1bfefe8cbdb443084efe919fdeb7 is lost.
Below is the master log, you can see both balancer and SSH try to assign the
region around the same time:
{code}
2013-09-25 11:55:32,731 INFO [MASTER_SERVER_OPERATIONS-hor15n05:60000-4]
master.RegionStates: Transitioning {6deb1bfefe8cbdb443084efe919fdeb7
state=PENDING_CLOSE, ts=1380110132710,
server=hor15n12.gq1.ygridcore.net,60020,1380109596307} will be handled by SSH
for hor15n12.gq1.ygridcore.net,60020,1380109596307
...
2013-09-25 11:55:32,849 INFO
[hor15n05.gq1.ygridcore.net,60000,1380108611483-BalancerChore]
master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7
state=OFFLINE, ts=1380110132768, server=null} to
{6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132849,
server=hor15n07.gq1.ygridcore.net,60020,1380109890414}
...
2013-09-25 11:55:32,898 INFO
[hor15n05.gq1.ygridcore.net,60000,1380108611483-GeneralBulkAssigner-1]
master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7
state=OFFLINE, ts=1380110132861, server=null} to
{6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132898,
server=hor15n09.gq1.ygridcore.net,60020,1380109280320}
{code}
Since SSH force region assignment while it doesn't recreate offline znode, the
later region opening would fail with the following error. I'm suggesting to
recreate offline znode when we force a region assignment(forceNewPlan=true)
with low impact.
{code}
2013-09-25 11:55:32,927 WARN [RS_OPEN_REGION-hor15n09:60020-2]
zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0 Attempt to transition
the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from
M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to
transition was hor15n09.gq1.ygridcore.net,60020,1380109280320 not the expected
hor15n07.gq1.ygridcore.net,60020,1380109890414
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira