[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master

Jerry He (JIRA) Mon, 08 Jun 2015 21:34:17 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578298#comment-14578298
 ]


Jerry He commented on HBASE-13845:
----------------------------------

ok. The master branch goes totally ZK-less region assignment.  The problem does 
not show up under ZK-less assignment.
We always call am.regionOffline(HRegionInfo.FIRST_META_REGIONINFO) before 
verifyAndAssignMetaWithRetries() when processing the meta server shutdown/crash:
{code}
      // Assign meta if we were carrying it.
      // Check again: region may be assigned to other where because of RIT
      // timeout
      if (am.isCarryingMeta(serverName)) {
        LOG.info("Server " + serverName + " was carrying META. Trying to 
assign.");
        am.regionOffline(HRegionInfo.FIRST_META_REGIONINFO);
        verifyAndAssignMetaWithRetries();
{code}

When using ZK-less assignment,  
am.regionOffline(HRegionInfo.FIRST_META_REGIONINFO) ==> 
RegionStateStore.updateRegionState() sets OFFLINE state on meta region server 
znode.
{code}
      // meta state stored in zk.
      if (hri.isMetaRegion()) {
        // persist meta state in MetaTableLocator (which in turn is zk storage 
currently)
        try {
          MetaTableLocator.setMetaLocation(server.getZooKeeper(),
            newState.getServerName(), hri.getReplicaId(), newState.getState());
          return; // Done
        }
{code}

Then MetaTableLocator.getMetaRegionLocation() will always return null.
{code}
  public ServerName getMetaRegionLocation(final ZooKeeperWatcher zkw, int 
replicaId) {
    try {
      RegionState state = getMetaRegionState(zkw, replicaId);
      return state.isOpened() ? state.getServerName() : null;
    } catch (KeeperException ke) {
      return null;
    }
  }
{code}
So we will never try to re-use existing meta region server location on ZK. 
In the master branch code:
{code}
  private void verifyAndAssignMeta(final MasterProcedureEnv env)
      throws InterruptedException, IOException, KeeperException {
    MasterServices services = env.getMasterServices();
    if (!isMetaAssignedQuickTest(env)) {
      
services.getAssignmentManager().assignMeta(HRegionInfo.FIRST_META_REGIONINFO);
    } else if (serverName.equals(services.getMetaTableLocator().
        getMetaRegionLocation(services.getZooKeeper()))) {
      throw new IOException("hbase:meta is onlined on the dead server " + 
this.serverName);
    } else {
      LOG.info("Skip assigning hbase:meta because it is online at "
          + 
services.getMetaTableLocator().getMetaRegionLocation(services.getZooKeeper()));
    }
  }
{code}
isMetaAssignedQuickTest() will always return false.

> Expire of one region server carrying meta can bring down the master
> -------------------------------------------------------------------
>
>                 Key: HBASE-13845
>                 URL: https://issues.apache.org/jira/browse/HBASE-13845
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.0.0, 1.1.0, 1.2.0
>            Reporter: Jerry He
>            Assignee: Jerry He
>             Fix For: 2.0.0, 1.2.0, 1.1.1
>
>         Attachments: HBASE-13845-branch-1.1.patch
>
>
> There seems to be a code bug that can cause expiration of one region server 
> carrying meta to bring down the master under certain case.
> Here is the sequence of event.
> a) The master detects the expiration of a region server on ZK, and starts to 
> expire the region server.
> b) Since the failed region server carries meta, the shutdown handler will 
> call verifyAndAssignMetaWithRetries() during processing the expired rs.
> c)  In verifyAndAssignMeta(), there is a logic to verifyMetaRegionLocation
> {code}
> (!server.getMetaTableLocator().verifyMetaRegionLocation(server.getConnection(),
>       this.server.getZooKeeper(), timeout)) {
>       this.services.getAssignmentManager().assignMeta
>       (HRegionInfo.FIRST_META_REGIONINFO);
>     } else if 
> (serverName.equals(server.getMetaTableLocator().getMetaRegionLocation(
>       this.server.getZooKeeper()))) {
>       throw new IOException("hbase:meta is onlined on the dead server "
>           + serverName);
> {code}
> If we see the meta region is still alive on the expired rs, we throw an 
> exception.
> We do some retries (default 10x1000ms) for verifyAndAssignMeta.
> If we still get the exception after retries, we abort the master.
> {code}
> 2015-05-27 06:58:30,156 FATAL 
> [MASTER_META_SERVER_OPERATIONS-bdvs1163:60000-0] master.HMaster: Master 
> server abort: loaded coprocessors are: []
> 2015-05-27 06:58:30,156 FATAL 
> [MASTER_META_SERVER_OPERATIONS-bdvs1163:60000-0] master.HMaster: 
> verifyAndAssignMeta failed after10 times retries, aborting
> java.io.IOException: hbase:meta is onlined on the dead server 
> bdvs1164.svl.ibm.com,16020,1432681743203
>         at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta(MetaServerShutdownHandler.java:162)
>         at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries(MetaServerShutdownHandler.java:184)
>         at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:93)
>         at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> 2015-05-27 06:58:30,156 INFO  
> [MASTER_META_SERVER_OPERATIONS-bdvs1163:60000-0] regionserver.HRegionServer: 
> STOPPED: verifyAndAssignMeta failed after10 times retries, aborting
> {code}
> The problem happens when the expired is slow processing its own expiration or 
> has a slow death, and is still able to respond to master's meta verification 
> in the meantime



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13845) Expire of one region server carrying meta can bring down the master

Reply via email to