[ https://issues.apache.org/jira/browse/HBASE-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439167#comment-16439167 ]
Chia-Ping Tsai commented on HBASE-20169: ---------------------------------------- {quote}Can you please explain the fix? And what's the actually stack trace you saw? The stack trace is always incomplete on jenkins. {quote} Pardon me, I just add some log to trace the null object so no full stack trace can be attached here. The story is about thread contention between ProcedureExecutor#stop and ProcedureExecutor#join. *first thread*: When shutdowning the mini cluster, the master#shutdown is executed. {code:java} activeMaster.master.shutdown();{code} And then the ServerManager#shutdownCluster is called. Because the rs which failed to instantiate cp is dead, the onlineServers.isEmpty() is true. That makes Master#run leave the loop. (*second thread*) {code:java} public void shutdownCluster() { String statusStr = "Cluster shutdown requested of master=" + this.master.getServerName(); LOG.info(statusStr); this.clusterShutdown.set(true); if (onlineServers.isEmpty()) { // we do not synchronize here so this may cause a double stop, but not a big deal master.stop("OnlineServer=0 right after cluster shutdown set"); } }{code} Since ProcedureExecutor#join sets null to timeoutExecutor, the NPE will happen if ProcedureExecutor#stop is called after ProcedureExecutor#join. {code:java} public void join() { assert !isRunning() : "expected not running"; // stop the timeout executor timeoutExecutor.awaitTermination(); timeoutExecutor = null;{code} {code:java} public void stop() { if (!running.getAndSet(false)) { return; } LOG.info("Stopping"); scheduler.stop(); timeoutExecutor.sendStopSignal(); }{code} > NPE when calling HBTU.shutdownMiniCluster (TestAssignmentManagerMetrics is > flakey) > ---------------------------------------------------------------------------------- > > Key: HBASE-20169 > URL: https://issues.apache.org/jira/browse/HBASE-20169 > Project: HBase > Issue Type: Bug > Components: test > Reporter: Duo Zhang > Assignee: stack > Priority: Major > Attachments: HBASE-20169.branch-2.001.patch, > HBASE-20169.branch-2.002.patch, HBASE-20169.branch-2.003.patch, > HBASE-20169.branch-2.004.patch, HBASE-20169.branch-2.005.patch, > HBASE-20169.v0.addendum.patch > > > This usually happens when some master or rs has already been down before we > calling shutdownMiniCluster. > See > https://builds.apache.org/job/HBASE-Flaky-Tests/27223/testReport/junit/org.apache.hadoop.hbase.master/TestAssignmentManagerMetrics/org_apache_hadoop_hbase_master_TestAssignmentManagerMetrics/ > and also > http://104.198.223.121:8080/job/HBASE-Flaky-Tests/34873/testReport/junit/org.apache.hadoop.hbase.master/TestRestartCluster/testRetainAssignmentOnRestart/ > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.master.TestAssignmentManagerMetrics.after(TestAssignmentManagerMetrics.java:100) > java.lang.NullPointerException > at > org.apache.hadoop.hbase.master.TestRestartCluster.testRetainAssignmentOnRestart(TestRestartCluster.java:156) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)