[
https://issues.apache.org/jira/browse/HBASE-22652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack resolved HBASE-22652.
---------------------------
Resolution: Fixed
Pushed amendment to 2.0+.
If think we should make change in lockmanager [~Apache9], lets open new issue
to dig in on it. Thanks.
> Flakey TestLockManager; test timed out after 780 seconds
> --------------------------------------------------------
>
> Key: HBASE-22652
> URL: https://issues.apache.org/jira/browse/HBASE-22652
> Project: HBase
> Issue Type: Bug
> Components: proc-v2
> Affects Versions: 2.1.5
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 3.0.0, 2.3.0, 2.0.6, 2.2.1, 2.1.6
>
> Attachments:
> 0001-HBASE-22652-Flakey-TestLockManager-test-timed-out-af.patch,
> HBASE-22652.branch-2.1.001.patch
>
>
> In nightly 1318 on branch-2.1, TestLockManager failed. It was only fail. It
> fails on occasion. In this instance, test failed like this:
> {code}
> Error Message
> test timed out after 780 seconds
> Stacktrace
> org.junit.runners.model.TestTimedOutException: test timed out after 780
> seconds
> at
> org.apache.hadoop.hbase.master.locking.TestLockManager.tearDown(TestLockManager.java:105)
> {code}
> Digging in log, before the timeout in teardown, you see this:
> {code}
> 2019-07-01 22:45:57,849 INFO [PEWorker-1]
> procedure2.ProcedureExecutor$WorkerThread(2041): ASSERT pid=11
> java.lang.AssertionError
> at
> org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread.add(TimeoutExecutorThread.java:81)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1765)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
> {code}
> i.e. an assert trips. The assert is doing this:
> {code}
> assert procedure.getState() == ProcedureState.WAITING_TIMEOUT;
> {code}
> The procedure is expected to be in WAITING_TIMEOUT state when we go to add it
> to the timeout executor. In the log snippet above, we see the thread is in
> RUNNABLE state. But even on successful run it is in RUNNABLE state. So what
> is going on.
> I think I figured it accidentally.
> I'd added a pause at the assert and when the assert ran, procedure was in
> RUNNABLE state and it would fail every time. So, it is a timing issue.
> I added logging of the setting of state on the procedure. There weren't many
> procedures so was easy to follow. Here is what I saw (log included emission
> of stack trace):
> {code}
> 2019-07-02 19:50:14,941 DEBUG [PEWorker-1] locking.LockProcedure(313): LOCKED
> pid=11, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure,
> namespace=namespace, type=EXCLUSIVE
> 2019-07-02 19:50:14,943 INFO [PEWorker-1] procedure2.Procedure(793): SET
> STATE ON pid=11, state=WAITING_TIMEOUT, locked=true;
> org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace,
> type=EXCLUSIVE
> java.lang.Exception
> at
> org.apache.hadoop.hbase.procedure2.Procedure.setState(Procedure.java:793)
> at
> org.apache.hadoop.hbase.master.locking.LockProcedure.execute(LockProcedure.java:249)
> at
> org.apache.hadoop.hbase.master.locking.LockProcedure.execute(LockProcedure.java:56)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:966)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1723)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
> 2019-07-02 19:50:14,944 INFO [Time-limited test] procedure2.Procedure(793):
> SET STATE ON pid=11, state=RUNNABLE, locked=true;
> org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace,
> type=EXCLUSIVE
> java.lang.Exception
> at
> org.apache.hadoop.hbase.procedure2.Procedure.setState(Procedure.java:793)
> at
> org.apache.hadoop.hbase.master.locking.LockProcedure.unlock(LockProcedure.java:225)
> at
> org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.release(LockManager.java:185)
> at
> org.apache.hadoop.hbase.master.locking.TestLockManager.testMasterLockAcquire(TestLockManager.java:124)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
> at java.util.concurrent.FutureTask.run(FutureTask.java)
> at java.lang.Thread.run(Thread.java:748)
> 2019-07-02 19:50:15,037 DEBUG [Time-limited test]
> procedure.MasterProcedureScheduler(356): Add TableQueue(hbase:namespace,
> xlock=false sharedLock=1 size=1) to run queue because: pid=11,
> state=RUNNABLE, locked=true;
> org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace,
> type=EXCLUSIVE has lock
> 2019-07-02 19:53:37,062 WARN [JvmPauseMonitor]
> util.JvmPauseMonitor$Monitor(176): Detected pause in JVM or host machine (eg
> GC): pause of approximately 201656ms
> No GCs detected
> 2019-07-02 19:53:37,063 WARN [ProcExecTimeout]
> procedure2.ProcedureExecutor$WorkerMonitor(2147): Worker stuck
> PEWorker-1(pid=11), run time 3mins, 22.122sec
> 2019-07-02 19:53:37,063 WARN [RS:0;localhost:57336] util.Sleeper(86): We
> slept 202150ms instead of 1000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2019-07-02 19:53:37,064 INFO [master/localhost:0.splitLogManager..Chore.1]
> hbase.ScheduledChore(176): Chore: SplitLogManager Timeout Monitor missed its
> start time
> Disconnected from the target VM, address: '127.0.0.1:57316', transport:
> 'socket'
> 2019-07-02 19:53:37,063 INFO [regionserver/localhost:0.Chore.1]
> hbase.ScheduledChore(176): Chore: MemstoreFlusherChore missed its start time
> 2019-07-02 19:53:37,069 INFO [PEWorker-1]
> procedure2.ProcedureExecutor$WorkerThread(2041): ASSERT pid=11
> java.lang.AssertionError
> at
> org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread.add(TimeoutExecutorThread.java:81)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1765)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
> {code}
> So, the worker thread goes to execute the lock procedure. It sets the state
> to WAITING_TIMEOUT... My break point trips and it holds up the timeout
> executor. Meantime the main test thread continues and releases lock which
> sets the procedure state back to RUNNABLE. I let the break point go and the
> worker thread finds the procedure state is RUNNABLE and the assertion trips.
> I think the assert should tolerate RUNNABLE state too.
> Looking at all ways into the add to timeout executor, all test in caller in
> one way or another for WAITING_TIMEOUT except for case where we are loading
> procedures from stores. This latter is a little unpredictable. The other
> entrances will have seen the state set to WAITING_TIMEOUT at least just
> before we get to the assert point.
> Let me relax the assert some.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)