[
https://issues.apache.org/jira/browse/HBASE-21518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701089#comment-16701089
]
Peter Somogyi commented on HBASE-21518:
---------------------------------------
I think I know what is happening here. After testDeleteWithFailover succeeds
teardown starts and shuts down cluster.
2018-11-26 08:48:11,860 DEBUG [Time-limited test] util.JVMClusterUtil(247):
Shutting down HBase Cluster
RegionServer and master is closed.
{noformat}
2018-11-26 08:48:13,003 INFO [RS:0;asf901:44139]
regionserver.HRegionServer(1134): Exiting;
stopping=asf901.gq1.ygridcore.net,44139,1543222082929; zookeeper connection
closed.
2018-11-26 08:48:13,004 INFO [RegionServerTracker-0]
master.RegionServerTracker(172): RegionServer ephemeral node deleted,
processing expiration [asf901.gq1.ygridcore.net,44139,1543222082929]
2018-11-26 08:48:13,005 DEBUG [RegionServerTracker-0] master.DeadServer(134):
Added asf901.gq1.ygridcore.net,44139,1543222082929; numProcessing=1
2018-11-26 08:48:13,005 INFO [Shutdown of
org.apache.hadoop.hbase.fs.HFileSystem@3e6c0222]
hbase.MiniHBaseCluster$SingleFileSystemShutdownThread(221): Hook closing
fs=org.apache.hadoop.hbase.fs.HFileSystem@3e6c0222
2018-11-26 08:48:13,005 INFO [RegionServerTracker-0]
master.ServerManager(585): Processing expiration of
asf901.gq1.ygridcore.net,44139,1543222082929 on
asf901.gq1.ygridcore.net,45952,1543222082842
2018-11-26 08:48:13,005 INFO [Time-limited test] util.JVMClusterUtil(335):
Shutdown of 2 master(s) and 1 regionserver(s) complete
{noformat}
New ServerCrashProcedure added basically at the same time.
{noformat}
2018-11-26 08:48:13,016 DEBUG [RegionServerTracker-0]
procedure2.ProcedureExecutor(1092): Stored pid=12,
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=asf901.gq1.ygridcore.net,44139,1543222082929, splitWal=true, meta=true
2018-11-26 08:48:13,017 DEBUG [RegionServerTracker-0]
procedure.MasterProcedureScheduler(339): Add
ServerQueue(asf901.gq1.ygridcore.net,44139,1543222082929, xlock=false
sharedLock=0 size=1) to run queue because: the exclusive lock is not held by
anyone when adding pid=12, state=RUNNABLE:SERVER_CRASH_START;
ServerCrashProcedure server=asf901.gq1.ygridcore.net,44139,1543222082929,
splitWal=true, meta=true
2018-11-26 08:48:13,017 DEBUG [RegionServerTracker-0]
assignment.AssignmentManager(1320):
Added=asf901.gq1.ygridcore.net,44139,1543222082929 to dead servers, submitted
shutdown handler to be executed meta=true
2018-11-26 08:48:13,019 DEBUG [PEWorker-1]
procedure.MasterProcedureScheduler(349): Remove
ServerQueue(asf901.gq1.ygridcore.net,44139,1543222082929, xlock=false
sharedLock=0 size=0) from run queue because: queue is empty after polling out
pid=12, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=asf901.gq1.ygridcore.net,44139,1543222082929, splitWal=true, meta=true
{noformat}
Sub procedure started to place meta table but there is no server running, this
procedure stuck and the teardown retries this until test timeout.
{noformat}
2018-11-26 08:48:13,041 INFO [PEWorker-1] assignment.AssignProcedure(254):
Starting pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true;
AssignProcedure table=hbase:meta, region=1588230740; rit=OFFLINE,
location=asf901.gq1.ygridcore.net,44139,1543222082929; forceNewPlan=false,
retain=true
2018-11-26 08:48:13,042 DEBUG [PEWorker-1] procedure2.RootProcedureState(153):
Add procedure pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH,
locked=true; AssignProcedure table=hbase:meta, region=1588230740 as the 3th
rollback step
2018-11-26 08:48:13,192 WARN [master/asf901:0]
assignment.AssignmentManager(1670): No servers available; cannot place 1
unassigned regions.
{noformat}
> TestMasterFailoverWithProcedures is flaky
> -----------------------------------------
>
> Key: HBASE-21518
> URL: https://issues.apache.org/jira/browse/HBASE-21518
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.2.0, 2.0.3, 2.1.2
> Reporter: Peter Somogyi
> Priority: Major
> Attachments: output.txt
>
>
> TestMasterFailoverWithProcedures test is failing frequently, times out. I
> faced this failure on 2.0.3RC0 vote and it also appears on multiple flaky
> dashboards.
> branch-2:
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2/2007/]
> branch-2.1:
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.1/2002/]
> branch-2.0:
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.0/1988/]
>
> {noformat}
> [INFO] Running
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> [ERROR] Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed:
> 780.648 s <<< FAILURE! - in
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> [ERROR]
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> Time elapsed: 749.024 s <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 780
> seconds
> at
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures.tearDown(TestMasterFailoverWithProcedures.java:86)
> [ERROR]
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> Time elapsed: 749.051 s <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-3-2
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)