[ 
https://issues.apache.org/jira/browse/HBASE-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439167#comment-16439167
 ] 

Chia-Ping Tsai commented on HBASE-20169:
----------------------------------------

{quote}Can you please explain the fix? And what's the actually stack trace you 
saw? The stack trace is always incomplete on jenkins.
{quote}
Pardon me, I just add some log to trace the null object so no full stack trace 
can be attached here.

The story is about thread contention between ProcedureExecutor#stop and 
ProcedureExecutor#join.

*first thread*: When shutdowning the mini cluster, the master#shutdown is 
executed.
{code:java}
activeMaster.master.shutdown();{code}
And then the ServerManager#shutdownCluster is called. Because the rs which 
failed to instantiate cp is dead, the onlineServers.isEmpty() is true. That 
makes Master#run leave the loop. (*second thread*)
{code:java}
public void shutdownCluster() {
  String statusStr = "Cluster shutdown requested of master=" + 
this.master.getServerName();
  LOG.info(statusStr);
  this.clusterShutdown.set(true);
  if (onlineServers.isEmpty()) {
    // we do not synchronize here so this may cause a double stop, but not a 
big deal
    master.stop("OnlineServer=0 right after cluster shutdown set");
  }
}{code}
Since ProcedureExecutor#join sets null to timeoutExecutor, the NPE will happen 
if ProcedureExecutor#stop is called after ProcedureExecutor#join.
{code:java}
public void join() {
  assert !isRunning() : "expected not running";

  // stop the timeout executor
  timeoutExecutor.awaitTermination();
  timeoutExecutor = null;{code}
{code:java}
public void stop() {
  if (!running.getAndSet(false)) {
    return;
  }

  LOG.info("Stopping");
  scheduler.stop();
  timeoutExecutor.sendStopSignal();
}{code}
 

 

> NPE when calling HBTU.shutdownMiniCluster (TestAssignmentManagerMetrics is 
> flakey)
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-20169
>                 URL: https://issues.apache.org/jira/browse/HBASE-20169
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Duo Zhang
>            Assignee: stack
>            Priority: Major
>         Attachments: HBASE-20169.branch-2.001.patch, 
> HBASE-20169.branch-2.002.patch, HBASE-20169.branch-2.003.patch, 
> HBASE-20169.branch-2.004.patch, HBASE-20169.branch-2.005.patch, 
> HBASE-20169.v0.addendum.patch
>
>
> This usually happens when some master or rs has already been down before we 
> calling shutdownMiniCluster.
> See
> https://builds.apache.org/job/HBASE-Flaky-Tests/27223/testReport/junit/org.apache.hadoop.hbase.master/TestAssignmentManagerMetrics/org_apache_hadoop_hbase_master_TestAssignmentManagerMetrics/
> and also
> http://104.198.223.121:8080/job/HBASE-Flaky-Tests/34873/testReport/junit/org.apache.hadoop.hbase.master/TestRestartCluster/testRetainAssignmentOnRestart/
> {noformat}
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hbase.master.TestAssignmentManagerMetrics.after(TestAssignmentManagerMetrics.java:100)
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hbase.master.TestRestartCluster.testRetainAssignmentOnRestart(TestRestartCluster.java:156)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to