[
https://issues.apache.org/jira/browse/HBASE-21518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703332#comment-16703332
]
Peter Somogyi commented on HBASE-21518:
---------------------------------------
ServerManager#expireServer checks if cluster is shutting down and in this case
does not create ServerCrashProcedure for dead servers. This AtomicBoolean
variable is set to true when ServerManager#shutdownCluster method is called,
however, there are 2 ServerManager instances and on expireServer a different
one is checked.
I added some debug logs with hashcodes where you can see that clusterShutdown
was set to true (hash=1980707837) but later on during shutdown the variable
contains false (hash=416244779) that's why ServerCrashProcedure is created
which hangs since there are no Master.
{noformat}
2018-11-29 15:43:21,948 INFO [Thread-81] master.ServerManager(160):
ServerManager initialized. clusterShutdown false, thread 210, hash 1980707837
2018-11-29 15:43:23,929 INFO [Thread-81] master.ServerManager(913): Called
isClusterShutdown. clusterShutdown=false, thread 210, hash=1980707837
2018-11-29 15:43:23,929 INFO [Thread-81] master.ServerManager(913): Called
isClusterShutdown. clusterShutdown=false, thread 210, hash=1980707837
2018-11-29 15:43:29,732 INFO [Thread-80] master.ServerManager(160):
ServerManager initialized. clusterShutdown false, thread 209, hash 416244779
2018-11-29 15:43:29,820 INFO [Thread-80] master.ServerManager(913): Called
isClusterShutdown. clusterShutdown=false, thread 209, hash=416244779
2018-11-29 15:43:29,820 INFO [Thread-80] master.ServerManager(913): Called
isClusterShutdown. clusterShutdown=false, thread 209, hash=416244779
2018-11-29 15:43:29,937 INFO [Time-limited test] master.ServerManager(904): Set
clusterShutdown to true, thread 14, hash 1980707837
2018-11-29 15:43:30,985 INFO [RegionServerTracker-0] master.ServerManager(913):
Called isClusterShutdown. clusterShutdown=false, thread 461, hash=416244779
2018-11-29 15:43:30,986 INFO [RegionServerTracker-0] master.ServerManager(913):
Called isClusterShutdown. clusterShutdown=false, thread 461, hash=416244779
2018-11-29 15:48:29,851 INFO [master/172.30.65.195:0.Chore.1]
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false,
thread 417, hash=416244779
2018-11-29 15:48:29,852 INFO [master/172.30.65.195:0.Chore.1]
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false,
thread 417, hash=416244779
2018-11-29 15:48:29,852 INFO [master/172.30.65.195:0.Chore.1]
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false,
thread 417, hash=416244779
2018-11-29 15:53:32,277 INFO [master/172.30.65.195:0.Chore.1]
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false,
thread 417, hash=416244779
2018-11-29 15:53:32,277 INFO [master/172.30.65.195:0.Chore.1]
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false,
thread 417, hash=416244779
2018-11-29 15:53:32,277 INFO [master/172.30.65.195:0.Chore.1]
master.ServerManager(913): Called isClusterShutdown. clusterShutdown=false,
thread 417, hash=416244779{noformat}
> TestMasterFailoverWithProcedures is flaky
> -----------------------------------------
>
> Key: HBASE-21518
> URL: https://issues.apache.org/jira/browse/HBASE-21518
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.2.0, 2.0.3, 2.1.2
> Reporter: Peter Somogyi
> Assignee: Peter Somogyi
> Priority: Major
> Attachments: output.txt
>
>
> TestMasterFailoverWithProcedures test is failing frequently, times out. I
> faced this failure on 2.0.3RC0 vote and it also appears on multiple flaky
> dashboards.
> branch-2:
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2/2007/]
> branch-2.1:
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.1/2002/]
> branch-2.0:
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.0/1988/]
>
> {noformat}
> [INFO] Running
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> [ERROR] Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed:
> 780.648 s <<< FAILURE! - in
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> [ERROR]
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> Time elapsed: 749.024 s <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 780
> seconds
> at
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures.tearDown(TestMasterFailoverWithProcedures.java:86)
> [ERROR]
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> Time elapsed: 749.051 s <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-3-2
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)