[ 
https://issues.apache.org/jira/browse/HBASE-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13605582#comment-13605582
 ] 

rajeshbabu commented on HBASE-8127:
-----------------------------------

bq. One time when I saw the opening RIT stuck is due to the 
offlineDisabledRegion function in assignment manager. As you can see we don't 
handle opening RIT inside the function.

If I am not wrong HBASE-7824 patch applied at that time right?

One problem I am suspecting with HBASE-7824 patch is 
{code}
+    if (preMetaServer != null && failedServers.contains(preMetaServer)) {
+      // create recovered edits file for .META. server
+      this.fileSystemManager.splitLog(preMetaServer);
+      failedServers.remove(preMetaServer);
+    }
{code}

If a RS carrying ROOT or META went down,we are not calling SSH for that RS(not 
even adding to deadservers). We are handling regions in transitions to the dead 
server by processRegionsInTransitions which can cause RIT stuck in case OPENING 
state. If znode in RS_ZK_REGION_OPENING state then we will just add to RIT and 
wait for TM to handle. 
{code}
       regionsInTransition.put(encodedRegionName, new RegionState(regionInfo,
            RegionState.State.OPENING, data.getStamp(), data.getOrigin()));
        failoverProcessedRegions.put(encodedRegionName, regionInfo);
{code}
When ever TM handles we we will assign,in that case RIT can stuck because its 
seeing table in DISABLING/DISABLED. If really the RS is ALIVE this case wont 
happen because after assignment unassign will be called.

for HBASE-7824 patch we can do below change which avoids RIT stuck like in 
opening state.
If meta RS is down before/during master restart we can add it to deadservers 
and start SSH by passing shouldSplitHlog as false because already splitted logs.
{code}
    this.deadservers.add(serverName);
    this.services.getExecutorService().submit(
      new ServerShutdownHandler(this.master, this.services, this.deadservers, 
serverName, false));
{code}

Any way actual problem you have given in description we can handle in SSH side. 
I am working on it.

One more thing about your feedback patch:
{code}
+        // delete RITs if exists in any state of disabling or disabled tables 
during master starts
+        // up
+        if (!hri.isMetaTable()) {
+          String tableName = hri.getTableNameAsString();
+          boolean disabled = this.zkTable.isDisabledTable(tableName);
+          if (disabled || this.zkTable.isDisablingTable(tableName)) {
+            ZKAssign.deleteNodeFailSilent(watcher, hri);
+            regionOffline(hri);
+            continue;
+          }
+        }
{code}

We dont know whether the DISABLING table region is already closed or not on RS, 
so we should not offline region directly. In SSH we can do because the RS is 
went down.

                
> Region of a disabling or disabled table could be stuck in transition state 
> when RS dies during Master initialization
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8127
>                 URL: https://issues.apache.org/jira/browse/HBASE-8127
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.5
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.94.7
>
>         Attachments: HBASE-8127_feedback.patch, HBASE-8127.patch, 
> hbase-8127_v1.patch, reproduce-hang.patch
>
>
> The issue happens when a RS dies during a master starts up. After the RS 
> reports open to the new master instance and dies immediately thereafter, the 
> RITs of disabling tables(or disabled table) on the died RS will be in RIT 
> state forever.
> I attached a patch to simulate the situation and you can run the following 
> command to reproduce the issue:
> {code}mvn test -PlocalTests 
> -Dtest=TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS{code}
> Basically, we skip regions of a dead server inside 
> AM.processDeadServersAndRecoverLostRegions as the following code and relies 
> on SSH to process those skipped regions:
> {code}
>           for (Pair<HRegionInfo, Result> deadRegion : deadServer.getValue()) {
>             nodes.remove(deadRegion.getFirst().getEncodedName());
>           }
> {code} 
> While in SSH, we skip regions of disabling(or disabled table) again by 
> function processDeadRegion. Finally comes to the issue that RITs of 
> disabling(or disabled table) stuck there forever.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to