[jira] [Commented] (HBASE-7799) reassigning region stuck in open still may not work correctly due to leftover ZK node

ramkrishna.s.vasudevan (JIRA) Wed, 13 Feb 2013 09:16:13 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13577739#comment-13577739
 ]


ramkrishna.s.vasudevan commented on HBASE-7799:
-----------------------------------------------

Debugging the code and reproduced the problem.
What i observed was 
{code}
try {
      ZKAssign.asyncCreateNodeOffline(watcher, state.getRegion(),
        destination, cb, state);
    } catch (KeeperException e) {
      if (e instanceof NodeExistsException) {
        LOG.warn("Node for " + state.getRegion() + " already exists");
      } else {
        server.abort("Unexpected ZK exception creating/setting node OFFLINE", 
e);
      }
      return false;
    }
return true;
{code}
The asyncCreateNodeOffline just always returns true because it does not wait 
for the callback to take action.  Also the callback does not throw 
NodeExistsException.
In short the catch block is a dead code.  
So should we make this synchronous or wait till the callback processes the 
current zk event?  The same thing exists in 0.94 also but bulk assign is not 
used in SSH except for Create table and enable table.
                
> reassigning region stuck in open still may not work correctly due to leftover 
> ZK node
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-7799
>                 URL: https://issues.apache.org/jira/browse/HBASE-7799
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>         Attachments: 
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted-output.txt.gz
>
>
> (logs grepped by region name, and abridged.
> META server was dead so OpenRegionHandler for the region took a while, and 
> was interrupted:
> {code}
> 2013-02-08 14:35:01,555 DEBUG 
> [RS_OPEN_REGION-10.11.2.92,64485,1360362800564-2] 
> handler.OpenRegionHandler(255): Interrupting thread 
> Thread[PostOpenDeployTasks:871d1c3bdf98a2c93b527cb6cc61327d,5,main]
> {code}
> Then master tried to force region offline and reassign:
> {code}
> 2013-02-08 14:35:06,500 INFO  
> [MASTER_SERVER_OPERATIONS-10.11.2.92,64483,1360362800340-1] 
> master.RegionStates(347): Found opening region 
> {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
>  state=OPENING, ts=1360362901596, server=10.11.2.92,64485,1360362800564} to 
> be reassigned by SSH for 10.11.2.92,64485,1360362800564
> 2013-02-08 14:35:06,500 INFO  
> [MASTER_SERVER_OPERATIONS-10.11.2.92,64483,1360362800340-1] 
> master.RegionStates(242): Region {NAME => 
> 'IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.',
>  STARTKEY => '7333332c', ENDKEY => '7ffffff8', ENCODED => 
> 871d1c3bdf98a2c93b527cb6cc61327d,} transitioned from 
> {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
>  state=OPENING, ts=1360362901596, server=10.11.2.92,64485,1360362800564} to 
> {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
>  state=CLOSED, ts=1360362906500, server=null}
> 2013-02-08 14:35:06,505 DEBUG 
> [10.11.2.92,64483,1360362800340-GeneralBulkAssigner-1] 
> master.AssignmentManager(1530): Forcing OFFLINE; 
> was={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
>  state=CLOSED, ts=1360362906500, server=null}
> 2013-02-08 14:35:06,506 DEBUG 
> [10.11.2.92,64483,1360362800340-GeneralBulkAssigner-1] 
> zookeeper.ZKAssign(176): master:64483-0x13cbbf1025d0000 Async create of 
> unassigned node for 871d1c3bdf98a2c93b527cb6cc61327d with OFFLINE state
> {code}
> But didn't delete the original ZK node?
> {code}
> 2013-02-08 14:35:06,509 WARN  [main-EventThread] master.OfflineCallback(59): 
> Node for /hbase/region-in-transition/871d1c3bdf98a2c93b527cb6cc61327d already 
> exists
> 2013-02-08 14:35:06,509 DEBUG [main-EventThread] master.OfflineCallback(69): 
> rs={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
>  state=OFFLINE, ts=1360362906506, server=null}, 
> server=10.11.2.92,64488,1360362800651
> 2013-02-08 14:35:06,512 DEBUG [main-EventThread] 
> master.OfflineCallback$ExistCallback(106): 
> rs={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
>  state=OFFLINE, ts=1360362906506, server=null}, 
> server=10.11.2.92,64488,1360362800651
> {code}
> So it went into infinite cycle of failing to assign due to this:
> {code}
> 2013-02-08 14:35:06,517 INFO  [PRI IPC Server handler 7 on 64488] 
> regionserver.HRegionServer(3435): Received request to open region: 
> IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.
>  on 10.11.2.92,64488,1360362800651
> 2013-02-08 14:35:06,521 WARN  
> [RS_OPEN_REGION-10.11.2.92,64488,1360362800651-0] zookeeper.ZKAssign(762): 
> regionserver:64488-0x13cbbf1025d0004 Attempt to transition the unassigned 
> node for 871d1c3bdf98a2c93b527cb6cc61327d from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING failed, the node existed but was in the state 
> RS_ZK_REGION_OPENING set by the server [wrong server name redacted, see 
> HBASE-7798]
> {code}
> Transitioning failed-to-open similarly fails.
> It seems like master needs to nuke ZK node unconditionally to offline?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7799) reassigning region stuck in open still may not work correctly due to leftover ZK node

Reply via email to