[ 
https://issues.apache.org/jira/browse/HBASE-14207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankaj Kumar updated HBASE-14207:
---------------------------------
    Priority: Critical  (was: Major)

> Region was hijacked and remained in transition when RS failed to open a 
> region and later regionplan changed to new RS on retry
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14207
>                 URL: https://issues.apache.org/jira/browse/HBASE-14207
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Pankaj Kumar
>            Assignee: Pankaj Kumar
>            Priority: Critical
>
> On production environment, following events happened
> 1. Master is trying to assign a region to RS, but due to 
> KeeperException$SessionExpiredException RS failed to open the region.
>       In RS log, saw multiple WARN log related to 
> KeeperException$SessionExpiredException 
>               > KeeperErrorCode = Session expired for 
> /hbase/region-in-transition/08f1935d652e5dbdac09b423b8f9401b
>               > Unable to get data of znode 
> /hbase/region-in-transition/08f1935d652e5dbdac09b423b8f9401b
> 2. Master retried to assign the region to same RS, but RS again failed.
> 3. On second retry new plan formed and this time plan destination (RS) is 
> different, so master send the request to new RS to open the region. But new 
> RS failed to open the region as there was server mismatch in ZNODE than the  
> expected current server name. 
> Logs Snippet:
> {noformat}
> HM
> 2015-07-14 03:50:29,759 | INFO  | master:T101PC03VM13:21300 | Processing 
> 08f1935d652e5dbdac09b423b8f9401b in state: M_ZK_REGION_OFFLINE | 
> org.apache.hadoop.hbase.master.AssignmentManager.processRegionsInTransition(AssignmentManager.java:644)
> 2015-07-14 03:50:29,759 | INFO  | master:T101PC03VM13:21300 | Transitioned 
> {08f1935d652e5dbdac09b423b8f9401b state=OFFLINE, ts=1436817029679, 
> server=null} to {08f1935d652e5dbdac09b423b8f9401b state=PENDING_OPEN, 
> ts=1436817029759, server=T101PC03VM13,21302,1436816690692} | 
> org.apache.hadoop.hbase.master.RegionStates.updateRegionState(RegionStates.java:327)
> 2015-07-14 03:50:29,760 | INFO  | master:T101PC03VM13:21300 | Processed 
> region 08f1935d652e5dbdac09b423b8f9401b in state M_ZK_REGION_OFFLINE, on 
> server: T101PC03VM13,21302,1436816690692 | 
> org.apache.hadoop.hbase.master.AssignmentManager.processRegionsInTransition(AssignmentManager.java:768)
> 2015-07-14 03:50:29,800 | INFO  | 
> MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Assigning 
> INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to 
> T101PC03VM13,21302,1436816690692 | 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1983)
> 2015-07-14 03:50:29,801 | WARN  | 
> MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Failed assignment of 
> INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to 
> T101PC03VM13,21302,1436816690692, trying to assign elsewhere instead; try=1 
> of 10 | 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2077)
> 2015-07-14 03:50:29,802 | INFO  | 
> MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Trying to re-assign 
> INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to 
> the same failed server. | 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2123)
> 2015-07-14 03:50:31,804 | INFO  | 
> MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Assigning 
> INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to 
> T101PC03VM13,21302,1436816690692 | 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1983)
> 2015-07-14 03:50:31,806 | WARN  | 
> MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Failed assignment of 
> INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to 
> T101PC03VM13,21302,1436816690692, trying to assign elsewhere instead; try=2 
> of 10 | 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2077)
> 2015-07-14 03:50:31,807 | INFO  | 
> MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Transitioned 
> {08f1935d652e5dbdac09b423b8f9401b state=PENDING_OPEN, ts=1436817031804, 
> server=T101PC03VM13,21302,1436816690692} to {08f1935d652e5dbdac09b423b8f9401b 
> state=OFFLINE, ts=1436817031807, server=T101PC03VM13,21302,1436816690692} | 
> org.apache.hadoop.hbase.master.RegionStates.updateRegionState(RegionStates.java:327)
> 2015-07-14 03:50:31,807 | INFO  | 
> MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Assigning 
> INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. to 
> T101PC03VM14,21302,1436816997967 | 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1983)
> 2015-07-14 03:50:31,807 | INFO  | 
> MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-3 | Transitioned 
> {08f1935d652e5dbdac09b423b8f9401b state=OFFLINE, ts=1436817031807, 
> server=T101PC03VM13,21302,1436816690692} to {08f1935d652e5dbdac09b423b8f9401b 
> state=PENDING_OPEN, ts=1436817031807, 
> server=T101PC03VM14,21302,1436816997967} | 
> org.apache.hadoop.hbase.master.RegionStates.updateRegionState(RegionStates.java:327)
> 2015-07-14 03:51:09,501 | INFO  | 
> MASTER_SERVER_OPERATIONS-T101PC03VM13:21300-4 | Skip assigning region in 
> transition on other server{08f1935d652e5dbdac09b423b8f9401b 
> state=PENDING_OPEN, ts=1436817031807, 
> server=T101PC03VM14,21302,1436816997967} | 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:250)
> {noformat}
> {noformat}
> RS - T101PC03VM14
> 2015-07-14 03:50:31,809 | INFO  | 
> PriorityRpcServer.handler=2,queue=0,port=21302 | Open 
> INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b. | 
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:3671)
> 2015-07-14 03:50:31,830 | WARN  | RS_OPEN_REGION-T101PC03VM14:21302-2 | 
> regionserver:21302-0xe4e88f6f1b70002, 
> quorum=t101pc03vm12:24002,t101pc03vm13:24002,t101pc03vm14:24002, 
> baseZNode=/hbase Attempt to transition the unassigned node for 
> 08f1935d652e5dbdac09b423b8f9401b from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING failed, the server that tried to transition was 
> T101PC03VM14,21302,1436816997967 not the expected 
> T101PC03VM13,21302,1436816690692 | 
> org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:875)
> 2015-07-14 03:50:31,830 | WARN  | RS_OPEN_REGION-T101PC03VM14:21302-2 | 
> Failed transition from OFFLINE to OPENING for 
> region=08f1935d652e5dbdac09b423b8f9401b | 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.transitionZookeeperOfflineToOpening(OpenRegionHandler.java:539)
> 2015-07-14 03:50:31,831 | WARN  | RS_OPEN_REGION-T101PC03VM14:21302-2 | 
> Region was hijacked? Opening cancelled for 
> encodedName=08f1935d652e5dbdac09b423b8f9401b | 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:132)
> 2015-07-14 03:50:31,831 | INFO  | RS_OPEN_REGION-T101PC03VM14:21302-2 | 
> Opening of region {ENCODED => 08f1935d652e5dbdac09b423b8f9401b, NAME => 
> 'INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b.', 
> STARTKEY => '', ENDKEY => '200'} failed, transitioning from OFFLINE to 
> FAILED_OPEN in ZK, expecting version -1 | 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tryTransitionFromOfflineToFailedOpen(OpenRegionHandler.java:436)
> 2015-07-14 03:50:31,834 | WARN  | RS_OPEN_REGION-T101PC03VM14:21302-2 | 
> regionserver:21302-0xe4e88f6f1b70002, 
> quorum=t101pc03vm12:24002,t101pc03vm13:24002,t101pc03vm14:24002, 
> baseZNode=/hbase Attempt to transition the unassigned node for 
> 08f1935d652e5dbdac09b423b8f9401b from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_FAILED_OPEN failed, the server that tried to transition was 
> T101PC03VM14,21302,1436816997967 not the expected 
> T101PC03VM13,21302,1436816690692 | 
> org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:875)
> 2015-07-14 03:50:31,834 | WARN  | RS_OPEN_REGION-T101PC03VM14:21302-2 | 
> Unable to mark region {ENCODED => 08f1935d652e5dbdac09b423b8f9401b, NAME => 
> 'INTER_CONCURRENCY_SETTING,,1436596137981.08f1935d652e5dbdac09b423b8f9401b.', 
> STARTKEY => '', ENDKEY => '200'} as FAILED_OPEN. It's likely that the master 
> already timed out this open attempt, and thus another RS already has the 
> region. | 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tryTransitionFromOfflineToFailedOpen(OpenRegionHandler.java:444)
> {noformat}
> Since ZNODE was not modified with new destination server detail, so RS failed 
> to open it. 
> On Plan change we should set true to 'setOfflineInZK', so that ZNODE will be 
> modified with new destination server detail after resetting 
> 'versionOfOfflineNode' to -1.
> {code}
>           if (plan != newPlan && 
> !plan.getDestination().equals(newPlan.getDestination())) {
>             // Clean out plan we failed execute and one that doesn't look 
> like it'll
>             // succeed anyways; we need a new plan!
>             // Transition back to OFFLINE
>             currentState = regionStates.updateRegionState(region, 
> State.OFFLINE);
>             versionOfOfflineNode = -1;
>             plan = newPlan;
>           } 
> {code}
> Please correct me if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to