[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594118#comment-14594118
 ] 

Andrew Purtell commented on HBASE-13937:
----------------------------------------

bq. Do you mind re-review according to above? 

The check for isDeadServer is moved out of the retry loop:
{code}
@@ -903,13 +901,14 @@ public class ServerManager {
   public boolean isServerReachable(ServerName server) {
     if (server == null) throw new NullPointerException("Passed server is 
null");
 
+    synchronized (this.onlineServers) {
+      if (this.deadservers.isDeadServer(server)) {
+        return false;
+      }
+    }
+
     RetryCounter retryCounter = pingRetryCounterFactory.create();
     while (retryCounter.shouldRetry()) {
-      synchronized (this.onlineServers) {
-        if (this.deadservers.isDeadServer(server)) {
-          return false;
-        }
-      }
       try {
         AdminService.BlockingInterface admin = getRsAdmin(server);
         if (admin != null) {
{code}

Yes, I was wrong about the second part. The patch under review here does:
{code}
@@ -917,11 +916,6 @@ public class ServerManager {
           return info != null && info.hasServerName()
             && server.getStartcode() == info.getServerName().getStartCode();
         }
-      } catch (RegionServerStoppedException | ServerNotRunningYetException e) {
-        if (LOG.isDebugEnabled()) {
-          LOG.debug("Couldn't reach " + server, e);
-        }
-        break;
{code}
I must have gone back to the wrong tab and looked at the patch on the original 
issue.

lgtm, FWIW

> Partially revert HBASE-13172 
> -----------------------------
>
>                 Key: HBASE-13937
>                 URL: https://issues.apache.org/jira/browse/HBASE-13937
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Region Assignment
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
>         Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
> parent jira (HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
> to put it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch 
> {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
> false, which then puts the Server to the {{RegionStates.deadServers}} list. 
> Once it is in that list, we can still assign and unassign regions to the RS 
> after it has started (because regular assignment does not check whether the 
> server is in  {{RegionStates.deadServers}}. However, after the first assign 
> and unassign, we cannot assign the region again since then the check for the 
> lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest 
> of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
> colorful history). For 1.1.1, I think we should just revert parts of 
> HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to