[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Jean-Daniel Cryans (JIRA) Fri, 25 Jun 2010 10:14:19 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882638#action_12882638
 ]


Jean-Daniel Cryans commented on HBASE-2707:
-------------------------------------------

So actually the code of process is looks like:

{code}
LOG.info("Log split complete, meta reassignment and scanning:");
    if (this.isRootServer) {
      LOG.info("ProcessServerShutdown reassigning ROOT region");
      master.getRegionManager().reassignRootRegion();
      isRootServer = false;  // prevent double reassignment... heh.
    }

    for (MetaRegion metaRegion : metaRegions) {
      LOG.info("ProcessServerShutdown setting to unassigned: " + 
metaRegion.toString());
      master.getRegionManager().setUnassigned(metaRegion.getRegionInfo(), true);
    }
    // one the meta regions are online, "forget" about them.  Since there are 
explicit
    // checks below to make sure meta/root are online, this is likely to occur.
    metaRegions.clear();

    if (!rootAvailable()) {
      // Return true so that worker does not put this request back on the
      // toDoQueue.
      // rootAvailable() has already put it on the delayedToDoQueue
      return true;
    }

    if (!rootRescanned) {
      // Scan the ROOT region
      Boolean result = new ScanRootRegion(
          new MetaRegion(master.getRegionManager().getRootRegionLocation(),
              HRegionInfo.ROOT_REGIONINFO), this.master).doWithRetries();
      if (result == null) {
        // Master is closing - give up
        return true;
      }

      if (LOG.isDebugEnabled()) {
        LOG.debug("Process server shutdown scanning root region on " +
          master.getRegionManager().getRootRegionLocation().getBindAddress() +
          " finished " + Thread.currentThread().getName());
      }
      rootRescanned = true;
    }
{code}

So if the RS had -ROOT-, it will be reassigned right away and then the method 
returns if !rootAvailable. Later when we come back and root was assigned, 
process server shutdown will finish its job. This is how the code you pasted 
succeeds.

> Can't recover from a dead ROOT server if any exceptions happens during log 
> splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually 
> from a GC-like event. It happens frequently to my TestReplication in 
> HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. 
> Removing old log dir 
> hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] 
> master.RegionServerOperationQueue(183): Failed processing: 
> ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed 
> todo queue
> java.io.IOException: Cannot delete: 
> hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at 
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at 
> org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: 
> /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] 
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
> delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] 
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
> delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] 
> master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average 
> load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] 
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
> delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] 
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
> delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if 
> ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

Reply via email to