Author: liyin
Date: Sat Dec 28 19:18:17 2013
New Revision: 1553891

URL: http://svn.apache.org/r1553891
Log:
[0.89-fb] [master] Ensure that rootRS death is processed correctly upon master 
failover.

Author: aaiyer

Summary:
We have seen cases where kill-hbase can get the system into a state where
the master is not able to assign any of the regions, because the master is
not assigning the root region.

This essentially seems to happen, because the old rootRegionServer was
being processed as dead. But, as part of that, the master does a root
scan -- that would fail because the region server is dead.

ProcessServerShutdown checks if the deadServer is the root server and tries
to get root assigned, when ProcessServerShutdown is instantiated. However, in
the case of a master failover, where the state is reconstructed from
ZKClusterStateRecovery, the rootRegionServer reference could be updated, after
the ProcessServerShutdown is created. In this case, PSS never makes any
progress after the split log stage.

Test Plan: re pro the scenario on my dev cluster.

Reviewers: liyintang, rshroff

Reviewed By: liyintang

CC: mbm, hbase-eng@

Differential Revision: https://phabricator.fb.com/D1091877

Modified:
    
hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java

Modified: 
hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
URL: 
http://svn.apache.org/viewvc/hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java?rev=1553891&r1=1553890&r2=1553891&view=diff
==============================================================================
--- 
hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
 (original)
+++ 
hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
 Sat Dec 28 19:18:17 2013
@@ -49,6 +49,7 @@ class ProcessServerShutdown extends Regi
   // Server name made of the concatenation of hostname, port and startcode
   // formatted as <code>&lt;hostname> ',' &lt;port> ',' &lt;startcode></code>
   private final String deadServer;
+  private long deadServerStartCode;
   private boolean isRootServer;
   private List<MetaRegion> metaRegions, metaRegionsUnassigned;
   private boolean rootRescanned;
@@ -83,6 +84,7 @@ class ProcessServerShutdown extends Regi
     super(master, serverInfo.getServerName());
     this.deadServer = serverInfo.getServerName();
     this.deadServerAddress = serverInfo.getServerAddress();
+    this.deadServerStartCode = serverInfo.getStartCode();
     this.rootRescanned = false;
     this.successfulMetaScans = new HashSet<String>();
     // check to see if I am responsible for either ROOT or any of the META 
tables.
@@ -315,6 +317,22 @@ class ProcessServerShutdown extends Regi
       if (LOG.isDebugEnabled()) {
         HServerAddress addr = 
master.getRegionManager().getRootRegionLocation();
         if (addr != null) {
+          if (addr.equals(deadServerAddress)) {
+            // We should not happen unless the master has restarted recently, 
because we
+            // explicitly call unsetRootRegion() in closeMetaRegions, which is 
called when
+            // ProcessServerShutdown was instantiated.
+            // However, in the case of a recovery by ZKClusterStateRecovery, 
it is possible that
+            // the rootRegion was updated after closeMetaRegions() was called. 
If we let the rootRegion
+            // point to a dead server,  the cluster might just block, because 
all ScanRootRegion calls
+            // will continue to fail. Let us fix this, by ensuring that the 
root gets reassigned.
+            if (deadServerStartCode == 
master.getRegionManager().getRootServerInfo().getStartCode()) {
+              LOG.error(ProcessServerShutdown.this.toString() + " unsetting 
root because it is on the dead server being processed" );
+              master.getRegionManager().reassignRootRegion();
+              return false;
+            } else {
+              LOG.info(ProcessServerShutdown.this.toString() + " NOT unsetting 
root because it is on the dead server, but different start code" );
+            }
+          }
           LOG.debug(ProcessServerShutdown.this.toString() + " scanning root 
region on " +
               addr.getBindAddress());
         } else {


Reply via email to