[jira] [Commented] (HBASE-14536) Balancer & SSH interfering with each other leading to unavailability

Stephen Yuan Jiang (JIRA) Thu, 08 Oct 2015 21:16:07 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949869#comment-14949869
 ]


Stephen Yuan Jiang commented on HBASE-14536:
--------------------------------------------

This JIRA is similar to HBASE-13330; however it is different as this JIRA 
involves META region unavailable - META region in dead server is handled by 
special MetaSSH.  

After balancer offlined the META region and before assigned it to the new RS, 
it found that the original hold server was dead - to avoid corruption, it 
skipped the assign and let SSH deal with the assignment.  
{noformat}
  private RegionState forceRegionStateToOffline()
    ...
    case OFFLINE:
      // This region could have been open on this server
      // for a while. If the server is dead and not processed
      // yet, we can move on only if the meta shows the
      // region is not on this server actually, or on a server
      // not dead, or dead and processed already.
      // In case not using ZK, we don't need this check because
      // we have the latest info in memory, and the caller
      // will do another round checking any way.
      if (useZKForAssignment
          && regionStates.isServerDeadAndNotProcessed(sn)
          && wasRegionOnDeadServerByMeta(region, sn)) {
        if (!regionStates.isRegionInTransition(region)) {
          LOG.info("Updating the state to " + State.OFFLINE + " to allow to be 
reassigned by SSH");
          regionStates.updateRegionState(region, State.OFFLINE);
    ...
{noformat}

However, when MetaSSH called AM.isCarryingRegion() to see whether the dead 
server was the host RS for the META region.  It found that the server 
information from in-memory cache is null and thought someone else was assigning 
the region.

{noformat}
2015-09-29 06:18:29,447 DEBUG 
[MASTER_META_SERVER_OPERATIONS-10.0.0.148:16000-2] master.AssignmentManager: 
based on AM, current region=hbase:meta,,1.1588230740 is on server=null server 
being checked: 10.0.0.149,16020,1443507203340
2015-09-29 06:18:29,451 INFO  
[MASTER_META_SERVER_OPERATIONS-10.0.0.148:16000-2] 
handler.MetaServerShutdownHandler: META has been assigned to otherwhere, skip 
assigning.
{noformat}

> Balancer & SSH interfering with each other leading to unavailability
> --------------------------------------------------------------------
>
>                 Key: HBASE-14536
>                 URL: https://issues.apache.org/jira/browse/HBASE-14536
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment
>    Affects Versions: 1.1.2
>            Reporter: Devaraj Das
>            Assignee: Stephen Yuan Jiang
>             Fix For: 1.1.4
>
>         Attachments: master-log.tgz
>
>
> Came across this in our cluster:
> 1. The meta was assigned to a server 10.0.0.149,16020,1443507203340
> {noformat}
> 2015-09-29 06:16:22,472 DEBUG [AM.ZK.Worker-pool2-t56] 
> master.RegionStates: Onlined 1588230740 on 
> 10.0.0.149,16020,1443507203340 {ENCODED => 1588230740, NAME => 
> 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
> {noformat}
> 2. The server dies at some point:
> {noformat}
> 2015-09-29 06:18:25,952 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [10.0.0.149,16020,1443507203340]
> 2015-09-29 06:18:25,955 DEBUG [main-EventThread] master.AssignmentManager: 
> based on AM, current 
> region=hbase:meta,,1.1588230740 is on server=10.0.0.149,16020,1443507203340 
> server being checked: 
> 10.0.0.149,16020,1443507203340
> {noformat}
> 3. The balancer had computed a plan that contained a move for the meta:
> {noformat}
> 2015-09-29 06:18:26,833 INFO  
> [B.defaultRpcServer.handler=12,queue=0,port=16000] master.HMaster: 
> balance hri=hbase:meta,,1.1588230740, 
> src=10.0.0.149,16020,1443507203340, dest=10.0.0.205,16020,1443507257905
> {noformat}
> 4. The following ensues after this, leading to the meta remaining unassigned:
> {noformat}
> 2015-09-29 06:18:26,859 DEBUG 
> [B.defaultRpcServer.handler=12,queue=0,port=16000] 
> master.AssignmentManager: Offline hbase:meta,,1.1588230740, no need to 
> unassign since it's on a dead server: 10.0.0.149,16020,1443507203340
> ......................
> 2015-09-29 06:18:26,899 INFO  
> [B.defaultRpcServer.handler=12,queue=0,port=16000] master.RegionStates: 
> Offlined 1588230740 from 10.0.0.149,16020,1443507203340
> .....................
> 2015-09-29 06:18:26,914 INFO  
> [B.defaultRpcServer.handler=12,queue=0,port=16000] 
> master.AssignmentManager: Skip assigning hbase:meta,,1.1588230740, it is 
> on a dead but not processed yet server: 10.0.0.149,16020,1443507203340
> ....................
> 2015-09-29 06:18:26,915 DEBUG [AM.ZK.Worker-pool2-t58] 
> master.AssignmentManager: Znode hbase:meta,,1.1588230740 deleted, 
> state: {1588230740 state=OFFLINE, ts=1443507506914, 
> server=10.0.0.149,16020,1443507203340}
> ....................
> 2015-09-29 06:18:29,447 DEBUG 
> [MASTER_META_SERVER_OPERATIONS-10.0.0.148:16000-2] master.AssignmentManager: 
> based on AM, current 
> region=hbase:meta,,1.1588230740 is on server=null server being checked: 
> 10.0.0.149,16020,1443507203340
> 2015-09-29 06:18:29,451 INFO  [MASTER_META_SERVER_OPERATIONS-
> 10.0.0.148:16000-2] handler.MetaServerShutdownHandler: META has been 
> assigned to otherwhere, skip assigning.
> 2015-09-29 06:18:29,452 DEBUG 
> [MASTER_META_SERVER_OPERATIONS-10.0.0.148:16000-2] 
> master.DeadServer: Finished processing 10.0.0.149,16020,1443507203340
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14536) Balancer & SSH interfering with each other leading to unavailability

Reply via email to