[ 
https://issues.apache.org/jira/browse/HBASE-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-13605:
----------------------------------
    Attachment: hbase-13605_v3-branch-1.1.patch

We have discovered an issue with the v1 patch which gets surfaced after 
applying the patch. It is due to this logic: 
In AM: 
{code}
      for (ServerName serverName: deadServers) {
        if (!serverManager.isServerDead(serverName)) {
          serverManager.expireServer(serverName); // Let SSH do region re-assign
        }
      }
{code}

Notice that we are expiring the server IF it is NOT dead. Seems weird, right? I 
assume this was added because to not trigger SSH twice. 

The v1 patch changes {{serverManager.isServerDead}} so that if a new server is 
registered in the online servers, the old server IS considered dead: 

{code}
  public synchronized boolean isServerDead(ServerName serverName) {
    if (serverName == null || deadservers.isDeadServer(serverName)
        || queuedDeadServers.contains(serverName)
        || requeuedDeadServers.containsKey(serverName)) {
      return true;
    }

    // we are not acquiring the lock
    ServerName onlineServer = 
findServerWithSameHostnamePortWithLock(serverName);
    if (onlineServer != null && serverName.getStartcode() < 
onlineServer.getStartcode()) {
      return true;
    }
{code}

In one of our tests, that is exactly what happened. The RS is registered with a 
new identifier (thus onlineServers contains the new definition), and because of 
this SSH for the old guy was never called. 

v3 patch fixes this condition by breaking the isServerDead() into two parts. 
Old callers relies on isServerInDeadList() while the assignment caller will 
rely on the new semantics.

We have been running tests including ITBLL, ITMTTR, etc with the v1 patch. 
Seems stable enough. v3 patch should fix the remaining issues. 

 



> RegionStates should not keep its list of dead servers
> -----------------------------------------------------
>
>                 Key: HBASE-13605
>                 URL: https://issues.apache.org/jira/browse/HBASE-13605
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>            Priority: Critical
>             Fix For: 2.0.0, 1.0.2, 1.1.1
>
>         Attachments: hbase-13605_v1.patch, hbase-13605_v3-branch-1.1.patch
>
>
> As mentioned in 
> https://issues.apache.org/jira/browse/HBASE-9514?focusedCommentId=13769761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13769761
>  and HBASE-12844 we should have only 1 source of cluster membership. 
> The list of dead server and RegionStates doing it's own liveliness check 
> (ServerManager.isServerReachable()) has caused an assignment problem again in 
> a test cluster where the region states "thinks" that the server is dead and 
> SSH will handle the region assignment. However the RS is not dead at all, 
> living happily, and never gets zk expiry or YouAreDeadException or anything. 
> This leaves the list of regions unassigned in OFFLINE state. 
> master assigning the region:
> {code}
> 15-04-20 09:02:25,780 DEBUG [AM.ZK.Worker-pool3-t330] master.RegionStates: 
> Onlined 77dddcd50c22e56bfff133c0e1f9165b on 
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 {ENCODED => 
> 77dddcd50c
> {code}
> Master then disabled the table, and unassigned the region:
> {code}
> 2015-04-20 09:02:27,158 WARN  [ProcedureExecutorThread-1] 
> zookeeper.ZKTableStateManager: Moving table loadtest_d1 state from DISABLING 
> to DISABLING
>  Starting unassign of 
> loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. (offlining), 
> current state: {77dddcd50c22e56bfff133c0e1f9165b state=OPEN, 
> ts=1429520545780,   
> server=os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268}
> bleProcedure$BulkDisabler-0] master.AssignmentManager: Sent CLOSE to 
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 for region 
> loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b.
> 2015-04-20 09:02:27,414 INFO  [AM.ZK.Worker-pool3-t316] master.RegionStates: 
> Offlined 77dddcd50c22e56bfff133c0e1f9165b from 
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
> {code}
> On table re-enable, AM does not assign the region: 
> {code}
> 2015-04-20 09:02:30,415 INFO  [ProcedureExecutorThread-3] 
> balancer.BaseLoadBalancer: Reassigned 25 regions. 25 retained the pre-restart 
> assignment.ยท
> 2015-04-20 09:02:30,415 INFO  [ProcedureExecutorThread-3] 
> procedure.EnableTableProcedure: Bulk assigning 25 region(s) across 5 
> server(s), retainAssignment=true
> l,16000,1429515659726-GeneralBulkAssigner-4] master.RegionStates: Couldn't 
> reach online server 
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
> l,16000,1429515659726-GeneralBulkAssigner-4] master.AssignmentManager: 
> Updating the state to OFFLINE to allow to be reassigned by SSH
> nmentManager: Skip assigning 
> loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b., it is on a dead 
> but not processed yet server: 
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to