[jira] [Commented] (HBASE-21518) TestMasterFailoverWithProcedures is flaky

Peter Somogyi (JIRA) Tue, 27 Nov 2018 13:55:08 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701089#comment-16701089
 ]


Peter Somogyi commented on HBASE-21518:
---------------------------------------

I think I know what is happening here. After testDeleteWithFailover succeeds 
teardown starts and shuts down cluster.
 2018-11-26 08:48:11,860 DEBUG [Time-limited test] util.JVMClusterUtil(247): 
Shutting down HBase Cluster

RegionServer and master is closed.
{noformat}
2018-11-26 08:48:13,003 INFO  [RS:0;asf901:44139] 
regionserver.HRegionServer(1134): Exiting; 
stopping=asf901.gq1.ygridcore.net,44139,1543222082929; zookeeper connection 
closed.
2018-11-26 08:48:13,004 INFO  [RegionServerTracker-0] 
master.RegionServerTracker(172): RegionServer ephemeral node deleted, 
processing expiration [asf901.gq1.ygridcore.net,44139,1543222082929]
2018-11-26 08:48:13,005 DEBUG [RegionServerTracker-0] master.DeadServer(134): 
Added asf901.gq1.ygridcore.net,44139,1543222082929; numProcessing=1
2018-11-26 08:48:13,005 INFO  [Shutdown of 
org.apache.hadoop.hbase.fs.HFileSystem@3e6c0222] 
hbase.MiniHBaseCluster$SingleFileSystemShutdownThread(221): Hook closing 
fs=org.apache.hadoop.hbase.fs.HFileSystem@3e6c0222
2018-11-26 08:48:13,005 INFO  [RegionServerTracker-0] 
master.ServerManager(585): Processing expiration of 
asf901.gq1.ygridcore.net,44139,1543222082929 on 
asf901.gq1.ygridcore.net,45952,1543222082842
2018-11-26 08:48:13,005 INFO  [Time-limited test] util.JVMClusterUtil(335): 
Shutdown of 2 master(s) and 1 regionserver(s) complete
{noformat}
New ServerCrashProcedure added basically at the same time.
{noformat}
2018-11-26 08:48:13,016 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor(1092): Stored pid=12, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=asf901.gq1.ygridcore.net,44139,1543222082929, splitWal=true, meta=true
2018-11-26 08:48:13,017 DEBUG [RegionServerTracker-0] 
procedure.MasterProcedureScheduler(339): Add 
ServerQueue(asf901.gq1.ygridcore.net,44139,1543222082929, xlock=false 
sharedLock=0 size=1) to run queue because: the exclusive lock is not held by 
anyone when adding pid=12, state=RUNNABLE:SERVER_CRASH_START; 
ServerCrashProcedure server=asf901.gq1.ygridcore.net,44139,1543222082929, 
splitWal=true, meta=true
2018-11-26 08:48:13,017 DEBUG [RegionServerTracker-0] 
assignment.AssignmentManager(1320): 
Added=asf901.gq1.ygridcore.net,44139,1543222082929 to dead servers, submitted 
shutdown handler to be executed meta=true
2018-11-26 08:48:13,019 DEBUG [PEWorker-1] 
procedure.MasterProcedureScheduler(349): Remove 
ServerQueue(asf901.gq1.ygridcore.net,44139,1543222082929, xlock=false 
sharedLock=0 size=0) from run queue because: queue is empty after polling out 
pid=12, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=asf901.gq1.ygridcore.net,44139,1543222082929, splitWal=true, meta=true
{noformat}
Sub procedure started to place meta table but there is no server running, this 
procedure stuck and the teardown retries this until test timeout.
{noformat}
2018-11-26 08:48:13,041 INFO  [PEWorker-1] assignment.AssignProcedure(254): 
Starting pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; 
AssignProcedure table=hbase:meta, region=1588230740; rit=OFFLINE, 
location=asf901.gq1.ygridcore.net,44139,1543222082929; forceNewPlan=false, 
retain=true
2018-11-26 08:48:13,042 DEBUG [PEWorker-1] procedure2.RootProcedureState(153): 
Add procedure pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, 
locked=true; AssignProcedure table=hbase:meta, region=1588230740 as the 3th 
rollback step
2018-11-26 08:48:13,192 WARN  [master/asf901:0] 
assignment.AssignmentManager(1670): No servers available; cannot place 1 
unassigned regions.
{noformat}

> TestMasterFailoverWithProcedures is flaky
> -----------------------------------------
>
>                 Key: HBASE-21518
>                 URL: https://issues.apache.org/jira/browse/HBASE-21518
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.2.0, 2.0.3, 2.1.2
>            Reporter: Peter Somogyi
>            Priority: Major
>         Attachments: output.txt
>
>
> TestMasterFailoverWithProcedures test is failing frequently, times out. I 
> faced this failure on 2.0.3RC0 vote and it also appears on multiple flaky 
> dashboards.
> branch-2: 
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2/2007/]
> branch-2.1: 
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.1/2002/]
> branch-2.0: 
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Flaky-Tests/job/branch-2.0/1988/]
>   
> {noformat}
> [INFO] Running 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> [ERROR] Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 
> 780.648 s <<< FAILURE! - in 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures
> [ERROR] 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures  
> Time elapsed: 749.024 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 780 
> seconds
>       at 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures.tearDown(TestMasterFailoverWithProcedures.java:86)
> [ERROR] 
> org.apache.hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures  
> Time elapsed: 749.051 s  <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-3-2
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21518) TestMasterFailoverWithProcedures is flaky

Reply via email to