[
https://issues.apache.org/jira/browse/HBASE-10085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeffrey Zhong updated HBASE-10085:
----------------------------------
Attachment: hbase-10085.patch
I created a unit test case for this issue and verified the fix.
Just reset to PENDING_OPEN may not work in processRegionsInTransition. The
assignment was indeed triggered while the issue is that the state stored in ZK
cause the assignment skipped because assignment think the SSH would cover it
later(this is expected).
Later the two old server SSHs both started as expected while both of them
skippped the region assignment:
The source server SSH thought the region was already in RIT state so skipped
it(this is expected) while the destination server SSH thought the RIT is in
unexpected state so skipping it which I think it's problematic as there are
several scenarios that RITs could be in offline state during region assignment.
> Some regions aren't re-assigned after a master restarts
> -------------------------------------------------------
>
> Key: HBASE-10085
> URL: https://issues.apache.org/jira/browse/HBASE-10085
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 0.96.1
> Reporter: Jeffrey Zhong
> Assignee: Jeffrey Zhong
> Fix For: 0.98.0, 0.96.1
>
> Attachments: hbase-10085.patch
>
>
> We see this issue happened in a cluster restart:
> 1) when shutdown a cluster, some regions are in offline state because no
> Region servers are available(stop RS and then Master)
> 2) When the cluster restarts, the offlined regions are forced to be offline
> again and SSH skip re-assigning them by function AM.processServerShutdown as
> shown below.
> {code}
> 2013-12-03 10:41:56,686 INFO
> [master:h2-ubuntu12-sec-1386048659-hbase-8:60000] master.AssignmentManager:
> Processing 873dbd8c269f44d0aefb0f66c5b53537 in state: M_ZK_REGION_OFFLINE
> 2013-12-03 10:41:56,686 DEBUG
> [master:h2-ubuntu12-sec-1386048659-hbase-8:60000] master.AssignmentManager:
> RIT 873dbd8c269f44d0aefb0f66c5b53537 in state=M_ZK_REGION_OFFLINE was on
> deadserver; forcing offline
> ...
> 2013-12-03 10:41:56,739 DEBUG [AM.-pool1-t8] master.AssignmentManager: Force
> region state offline {873dbd8c269f44d0aefb0f66c5b53537 state=OFFLINE,
> ts=1386067316737,
> server=h2-ubuntu12-sec-1386048659-hbase-6.cs1cloud.internal,60020,1386066968696}
> ...
> 2013-12-03 10:41:57,223 WARN
> [MASTER_SERVER_OPERATIONS-h2-ubuntu12-sec-1386048659-hbase-8:60000-3]
> master.RegionStates: THIS SHOULD NOT HAPPEN: unexpected
> {873dbd8c269f44d0aefb0f66c5b53537 state=OFFLINE, ts=1386067316737,
> server=h2-ubuntu12-sec-1386048659-hbase-6.cs1cloud.internal,60020,1386066968696}
>
> {code}
--
This message was sent by Atlassian JIRA
(v6.1#6144)