[
https://issues.apache.org/jira/browse/HBASE-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578501#comment-13578501
]
Jimmy Xiang commented on HBASE-7799:
------------------------------------
@Ram, thanks a lot for verifying the patch.
bq. But the hris obtained by META scan does not have it in SSH. So i dont find
this region getting populated in toAssignRegions
If Am.procesServerShutdown() picks it up, then if should be in toAssignRegions
since we add them all even it is not in hris:
{code} SSH
List<HRegionInfo> toAssignRegions = new ArrayList<HRegionInfo>();
toAssignRegions.addAll(regionsInTransition);
{code}
Am.processServerShutdown could skip a region if it is not pending open any more:
{code} AM
if (regionState == null
|| !regionState.isPendingOpenOrOpeningOnServer(sn)) {
LOG.info("Skip region " + hri
+ " since it is not opening on the dead server any more: " + sn);
it.remove();
{code}
Could you please check if the region is still pending open?
As to the callback issue, yes, it can be solved in another jira.
> reassigning region stuck in open still may not work correctly due to leftover
> ZK node
> -------------------------------------------------------------------------------------
>
> Key: HBASE-7799
> URL: https://issues.apache.org/jira/browse/HBASE-7799
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Attachments:
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted-output.txt.gz,
> trunk-7799.patch
>
>
> (logs grepped by region name, and abridged.
> META server was dead so OpenRegionHandler for the region took a while, and
> was interrupted:
> {code}
> 2013-02-08 14:35:01,555 DEBUG
> [RS_OPEN_REGION-10.11.2.92,64485,1360362800564-2]
> handler.OpenRegionHandler(255): Interrupting thread
> Thread[PostOpenDeployTasks:871d1c3bdf98a2c93b527cb6cc61327d,5,main]
> {code}
> Then master tried to force region offline and reassign:
> {code}
> 2013-02-08 14:35:06,500 INFO
> [MASTER_SERVER_OPERATIONS-10.11.2.92,64483,1360362800340-1]
> master.RegionStates(347): Found opening region
> {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
> state=OPENING, ts=1360362901596, server=10.11.2.92,64485,1360362800564} to
> be reassigned by SSH for 10.11.2.92,64485,1360362800564
> 2013-02-08 14:35:06,500 INFO
> [MASTER_SERVER_OPERATIONS-10.11.2.92,64483,1360362800340-1]
> master.RegionStates(242): Region {NAME =>
> 'IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.',
> STARTKEY => '7333332c', ENDKEY => '7ffffff8', ENCODED =>
> 871d1c3bdf98a2c93b527cb6cc61327d,} transitioned from
> {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
> state=OPENING, ts=1360362901596, server=10.11.2.92,64485,1360362800564} to
> {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
> state=CLOSED, ts=1360362906500, server=null}
> 2013-02-08 14:35:06,505 DEBUG
> [10.11.2.92,64483,1360362800340-GeneralBulkAssigner-1]
> master.AssignmentManager(1530): Forcing OFFLINE;
> was={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
> state=CLOSED, ts=1360362906500, server=null}
> 2013-02-08 14:35:06,506 DEBUG
> [10.11.2.92,64483,1360362800340-GeneralBulkAssigner-1]
> zookeeper.ZKAssign(176): master:64483-0x13cbbf1025d0000 Async create of
> unassigned node for 871d1c3bdf98a2c93b527cb6cc61327d with OFFLINE state
> {code}
> But didn't delete the original ZK node?
> {code}
> 2013-02-08 14:35:06,509 WARN [main-EventThread] master.OfflineCallback(59):
> Node for /hbase/region-in-transition/871d1c3bdf98a2c93b527cb6cc61327d already
> exists
> 2013-02-08 14:35:06,509 DEBUG [main-EventThread] master.OfflineCallback(69):
> rs={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
> state=OFFLINE, ts=1360362906506, server=null},
> server=10.11.2.92,64488,1360362800651
> 2013-02-08 14:35:06,512 DEBUG [main-EventThread]
> master.OfflineCallback$ExistCallback(106):
> rs={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
> state=OFFLINE, ts=1360362906506, server=null},
> server=10.11.2.92,64488,1360362800651
> {code}
> So it went into infinite cycle of failing to assign due to this:
> {code}
> 2013-02-08 14:35:06,517 INFO [PRI IPC Server handler 7 on 64488]
> regionserver.HRegionServer(3435): Received request to open region:
> IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
> on 10.11.2.92,64488,1360362800651
> 2013-02-08 14:35:06,521 WARN
> [RS_OPEN_REGION-10.11.2.92,64488,1360362800651-0] zookeeper.ZKAssign(762):
> regionserver:64488-0x13cbbf1025d0004 Attempt to transition the unassigned
> node for 871d1c3bdf98a2c93b527cb6cc61327d from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server [wrong server name redacted, see
> HBASE-7798]
> {code}
> Transitioning failed-to-open similarly fails.
> It seems like master needs to nuke ZK node unconditionally to offline?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira