[
https://issues.apache.org/jira/browse/HBASE-20829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529678#comment-16529678
]
Duo Zhang commented on HBASE-20829:
-----------------------------------
The error is
{noformat}
2018-07-02 07:45:05,444 ERROR [Time-limited test]
replication.TestSyncReplicationStandbyKillRS(93): Failed to transit standby
cluster to DOWNGRADE_ACTIVE
org.apache.hadoop.hbase.exceptions.TimeoutIOException:
java.util.concurrent.TimeoutException: The procedure 23 is still running
at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2156)
at
org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4019)
at
org.apache.hadoop.hbase.replication.TestSyncReplicationStandbyKillRS.testStandbyKillRegionServer(TestSyncReplicationStandbyKillRS.java:90)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: The procedure 23 is still
running
at
org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3504)
at
org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3425)
at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2152)
... 24 more
{noformat}
With the pid=23 then we can trace the execution of the procedure
{noformat}
2018-07-02 07:34:59,809 INFO [PEWorker-9]
procedure2.ProcedureExecutor$WorkerThread(1763): ASSERT pid=29
java.lang.AssertionError: expected to add a child in the front
at
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler.doAdd(MasterProcedureScheduler.java:152)
at
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler.enqueue(MasterProcedureScheduler.java:133)
at
org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.push(AbstractProcedureScheduler.java:115)
at
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler.yield(MasterProcedureScheduler.java:120)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1486)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1241)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1761)
2018-07-02 07:34:59,810 WARN [PEWorker-9]
procedure2.ProcedureExecutor$WorkerThread(1776): Worker terminating UNNATURALLY
null
java.lang.AssertionError: expected to add a child in the front
at
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler.doAdd(MasterProcedureScheduler.java:152)
at
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler.enqueue(MasterProcedureScheduler.java:133)
at
org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.push(AbstractProcedureScheduler.java:115)
at
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler.yield(MasterProcedureScheduler.java:120)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1486)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1241)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1761)
{noformat}
Let me dig more.
> TestSyncReplicationStandbyKillRS is flakey
> ------------------------------------------
>
> Key: HBASE-20829
> URL: https://issues.apache.org/jira/browse/HBASE-20829
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: Duo Zhang
> Assignee: Duo Zhang
> Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-20829-debug.patch,
> org.apache.hadoop.hbase.replication.TestSyncReplicationStandbyKillRS-output.txt
>
>
> Timed out.
> {noformat}
> 2018-06-30 01:32:33,823 ERROR [Time-limited test]
> replication.TestSyncReplicationStandbyKillRS(93): Failed to transit standby
> cluster to DOWNGRADE_ACTIVE
> {noformat}
> We failed to transit the state to DA and then wait for it to become DA so
> hang there.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)