[ 
https://issues.apache.org/jira/browse/HBASE-26468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HBASE-26468:
---------------------------------
    Description: 
Observed this in our production cluster running 1.6 version.
RS crashed due to some reason but the process was still running. On debugging 
more, found out there was 1 non-daemon thread running and that was not allowing 
RS to exit cleanly. Our clusters are managed by Ambari and have auto restart 
capability within them. But since the process was running and pid file was 
present, Ambari also couldn't do much. There will be some bug where we will 
miss to stop some non daemon thread. Shutdown hook will not be called unless 
one of the following 2 conditions are met:

# The Java virtual machine shuts down in response to two kinds of events:
The program exits normally, when the last non-daemon thread exits or when the 
exit (equivalently, System.exit) method is invoked, or
# The virtual machine is terminated in response to a user interrupt, such as 
typing ^C, or a system-wide event, such as user logoff or system shutdown.

Considering the first condition, when the last non-daemon thread exits or when 
the exit method is invoked.

Below is the code snippet from 
[HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java#L51]

{code:java}
  private int start() throws Exception {
    try {
      if (LocalHBaseCluster.isLocal(conf)) {
         // Ignore this.
      } else {
        HRegionServer hrs = 
HRegionServer.constructRegionServer(regionServerClass, conf);
        hrs.start();
        hrs.join();
        if (hrs.isAborted()) {
          throw new RuntimeException("HRegionServer Aborted");
        }
      }
    } catch (Throwable t) {
      LOG.error("Region server exiting", t);
      return 1;
    }
    return 0;
  }
{code}

Within HRegionServer, there is a subtle difference between when a server is 
aborted v/s when it is stopped. If it is stopped, then isAborted will return 
false and it will exit with return code 0.

Below is the code from 
[ServerCommandLine.java|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/ServerCommandLine.java#L147]


{code:java}
  public void doMain(String args[]) {
    try {
      int ret = ToolRunner.run(HBaseConfiguration.create(), this, args);
      if (ret != 0) {
        System.exit(ret);
      }
    } catch (Exception e) {
      LOG.error("Failed to run", e);
      System.exit(-1);
    }
  }
{code}

If return code is 0, then it won't call System.exit. This means JVM will wait 
to call ShutdownHook until all non daemon threads are stopped which means 
infinite wait if we don't close all non-daemon threads cleanly.



  was:
Observed this in our production cluster running 1.6 version.
RS crashed due to some reason but the process was still running. On debugging 
more, found out there was 1 non-daemon thread running and that was not allowing 
RS to exit cleanly. Our clusters are managed by Ambari and have auto restart 
capability within them. But since the process was running and pid file was 
present, Ambari also couldn't do much. There will be some bug where we will 
miss to stop some non daemon thread but there should be some maximum amount of 
time we should wait before exiting the thread.

Relevant code: 
[HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java]
{code:java}
        logProcessInfo(getConf());
        HRegionServer hrs = 
HRegionServer.constructRegionServer(regionServerClass, conf);
        hrs.start();
        hrs.join();  -----> This should be a timed join.
        if (hrs.isAborted()) {
          throw new RuntimeException("HRegionServer Aborted");
        }
      }
{code}


> Region Server doesn't exit cleanly incase it crashes.
> -----------------------------------------------------
>
>                 Key: HBASE-26468
>                 URL: https://issues.apache.org/jira/browse/HBASE-26468
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 1.6.0
>            Reporter: Rushabh Shah
>            Priority: Major
>
> Observed this in our production cluster running 1.6 version.
> RS crashed due to some reason but the process was still running. On debugging 
> more, found out there was 1 non-daemon thread running and that was not 
> allowing RS to exit cleanly. Our clusters are managed by Ambari and have auto 
> restart capability within them. But since the process was running and pid 
> file was present, Ambari also couldn't do much. There will be some bug where 
> we will miss to stop some non daemon thread. Shutdown hook will not be called 
> unless one of the following 2 conditions are met:
> # The Java virtual machine shuts down in response to two kinds of events:
> The program exits normally, when the last non-daemon thread exits or when the 
> exit (equivalently, System.exit) method is invoked, or
> # The virtual machine is terminated in response to a user interrupt, such as 
> typing ^C, or a system-wide event, such as user logoff or system shutdown.
> Considering the first condition, when the last non-daemon thread exits or 
> when the exit method is invoked.
> Below is the code snippet from 
> [HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java#L51]
> {code:java}
>   private int start() throws Exception {
>     try {
>       if (LocalHBaseCluster.isLocal(conf)) {
>          // Ignore this.
>       } else {
>         HRegionServer hrs = 
> HRegionServer.constructRegionServer(regionServerClass, conf);
>         hrs.start();
>         hrs.join();
>         if (hrs.isAborted()) {
>           throw new RuntimeException("HRegionServer Aborted");
>         }
>       }
>     } catch (Throwable t) {
>       LOG.error("Region server exiting", t);
>       return 1;
>     }
>     return 0;
>   }
> {code}
> Within HRegionServer, there is a subtle difference between when a server is 
> aborted v/s when it is stopped. If it is stopped, then isAborted will return 
> false and it will exit with return code 0.
> Below is the code from 
> [ServerCommandLine.java|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/ServerCommandLine.java#L147]
> {code:java}
>   public void doMain(String args[]) {
>     try {
>       int ret = ToolRunner.run(HBaseConfiguration.create(), this, args);
>       if (ret != 0) {
>         System.exit(ret);
>       }
>     } catch (Exception e) {
>       LOG.error("Failed to run", e);
>       System.exit(-1);
>     }
>   }
> {code}
> If return code is 0, then it won't call System.exit. This means JVM will wait 
> to call ShutdownHook until all non daemon threads are stopped which means 
> infinite wait if we don't close all non-daemon threads cleanly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to