[jira] [Resolved] (HBASE-22652) Flakey TestLockManager; test timed out after 780 seconds

stack (JIRA) Wed, 03 Jul 2019 07:53:10 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-22652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack resolved HBASE-22652.
---------------------------
    Resolution: Fixed

Pushed amendment to 2.0+.

If think we should make change in lockmanager [~Apache9], lets open new issue 
to dig in on it. Thanks.

> Flakey TestLockManager; test timed out after 780 seconds
> --------------------------------------------------------
>
>                 Key: HBASE-22652
>                 URL: https://issues.apache.org/jira/browse/HBASE-22652
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.1.5
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 3.0.0, 2.3.0, 2.0.6, 2.2.1, 2.1.6
>
>         Attachments: 
> 0001-HBASE-22652-Flakey-TestLockManager-test-timed-out-af.patch, 
> HBASE-22652.branch-2.1.001.patch
>
>
> In nightly 1318 on branch-2.1, TestLockManager failed. It was only fail. It 
> fails on occasion. In this instance, test failed like this:
> {code}
> Error Message
> test timed out after 780 seconds
> Stacktrace
> org.junit.runners.model.TestTimedOutException: test timed out after 780 
> seconds
>       at 
> org.apache.hadoop.hbase.master.locking.TestLockManager.tearDown(TestLockManager.java:105)
> {code}
> Digging in log, before the timeout in teardown, you see this:
> {code}
> 2019-07-01 22:45:57,849 INFO  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2041): ASSERT pid=11
> java.lang.AssertionError
>       at 
> org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread.add(TimeoutExecutorThread.java:81)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1765)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
> {code}
> i.e. an assert trips. The assert is doing this:
> {code}
>     assert procedure.getState() == ProcedureState.WAITING_TIMEOUT;
> {code}
> The procedure is expected to be in WAITING_TIMEOUT state when we go to add it 
> to the timeout executor. In the log snippet above, we see the thread is in 
> RUNNABLE state. But even on successful run it is in RUNNABLE state. So what 
> is going on.
> I think I figured it accidentally.
> I'd added a pause at the assert and when the assert ran, procedure was in 
> RUNNABLE state and it would fail every time. So, it is a timing issue.
> I added logging of the setting of state on the procedure. There weren't many 
> procedures so was easy to follow. Here is what I saw (log included emission 
> of stack trace):
> {code}
> 2019-07-02 19:50:14,941 DEBUG [PEWorker-1] locking.LockProcedure(313): LOCKED 
> pid=11, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, 
> namespace=namespace, type=EXCLUSIVE
> 2019-07-02 19:50:14,943 INFO  [PEWorker-1] procedure2.Procedure(793): SET 
> STATE ON pid=11, state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, 
> type=EXCLUSIVE
> java.lang.Exception
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.setState(Procedure.java:793)
>       at 
> org.apache.hadoop.hbase.master.locking.LockProcedure.execute(LockProcedure.java:249)
>       at 
> org.apache.hadoop.hbase.master.locking.LockProcedure.execute(LockProcedure.java:56)
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:966)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1723)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
> 2019-07-02 19:50:14,944 INFO  [Time-limited test] procedure2.Procedure(793): 
> SET STATE ON pid=11, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, 
> type=EXCLUSIVE
> java.lang.Exception
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.setState(Procedure.java:793)
>       at 
> org.apache.hadoop.hbase.master.locking.LockProcedure.unlock(LockProcedure.java:225)
>       at 
> org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.release(LockManager.java:185)
>       at 
> org.apache.hadoop.hbase.master.locking.TestLockManager.testMasterLockAcquire(TestLockManager.java:124)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>       at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>       at java.util.concurrent.FutureTask.run(FutureTask.java)
>       at java.lang.Thread.run(Thread.java:748)
> 2019-07-02 19:50:15,037 DEBUG [Time-limited test] 
> procedure.MasterProcedureScheduler(356): Add TableQueue(hbase:namespace, 
> xlock=false sharedLock=1 size=1) to run queue because: pid=11, 
> state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, 
> type=EXCLUSIVE has lock
> 2019-07-02 19:53:37,062 WARN  [JvmPauseMonitor] 
> util.JvmPauseMonitor$Monitor(176): Detected pause in JVM or host machine (eg 
> GC): pause of approximately 201656ms
> No GCs detected
> 2019-07-02 19:53:37,063 WARN  [ProcExecTimeout] 
> procedure2.ProcedureExecutor$WorkerMonitor(2147): Worker stuck 
> PEWorker-1(pid=11), run time 3mins, 22.122sec
> 2019-07-02 19:53:37,063 WARN  [RS:0;localhost:57336] util.Sleeper(86): We 
> slept 202150ms instead of 1000ms, this is likely due to a long garbage 
> collecting pause and it's usually bad, see 
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2019-07-02 19:53:37,064 INFO  [master/localhost:0.splitLogManager..Chore.1] 
> hbase.ScheduledChore(176): Chore: SplitLogManager Timeout Monitor missed its 
> start time
> Disconnected from the target VM, address: '127.0.0.1:57316', transport: 
> 'socket'
> 2019-07-02 19:53:37,063 INFO  [regionserver/localhost:0.Chore.1] 
> hbase.ScheduledChore(176): Chore: MemstoreFlusherChore missed its start time
> 2019-07-02 19:53:37,069 INFO  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2041): ASSERT pid=11
> java.lang.AssertionError
>       at 
> org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread.add(TimeoutExecutorThread.java:81)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1765)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
> {code}
> So, the worker thread goes to execute the lock procedure. It sets the state 
> to WAITING_TIMEOUT... My break point trips and it holds up the timeout 
> executor. Meantime the main test thread continues and releases lock which 
> sets the procedure state back to RUNNABLE. I let the break point go and the 
> worker thread finds the procedure state is RUNNABLE and the assertion trips.
> I think the assert should tolerate RUNNABLE state too.
> Looking at all ways into the add to timeout executor, all test in caller in 
> one way or another for WAITING_TIMEOUT except for case where we are loading 
> procedures from stores. This latter is a little unpredictable. The other 
> entrances will have seen the state set to WAITING_TIMEOUT at least just 
> before we get to the assert point.
> Let me relax the assert some.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (HBASE-22652) Flakey TestLockManager; test timed out after 780 seconds

Reply via email to