[jira] [Created] (HBASE-22652) Flakey TestLockManager; test timed out after 780 seconds

stack (JIRA) Tue, 02 Jul 2019 20:13:24 -0700

stack created HBASE-22652:
-----------------------------

             Summary: Flakey TestLockManager; test timed out after 780 seconds
                 Key: HBASE-22652
                 URL: https://issues.apache.org/jira/browse/HBASE-22652
             Project: HBase
          Issue Type: Bug
          Components: proc-v2
            Reporter: stack
            Assignee: stack
             Fix For: 2.1.6



In nightly 1318 on branch-2.1, TestLockManager failed. It was only fail. It 
fails on occasion. In this instance, test failed like this:

{code}
Error Message
test timed out after 780 seconds
Stacktrace
org.junit.runners.model.TestTimedOutException: test timed out after 780 seconds
        at 
org.apache.hadoop.hbase.master.locking.TestLockManager.tearDown(TestLockManager.java:105)
{code}

Digging in log, before the timeout in teardown, you see this:
{code}
2019-07-01 22:45:57,849 INFO  [PEWorker-1] 
procedure2.ProcedureExecutor$WorkerThread(2041): ASSERT pid=11
java.lang.AssertionError
        at 
org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread.add(TimeoutExecutorThread.java:81)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1765)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
{code}

i.e. an assert trips. The assert is doing this:

{code}
    assert procedure.getState() == ProcedureState.WAITING_TIMEOUT;
{code}

The procedure is expected to be in WAITING_TIMEOUT state when we go to add it 
to the timeout executor. In the log snippet above, we see the thread is in 
RUNNABLE state. But even on successful run it is in RUNNABLE state. So what is 
going on.

I think I figured it accidentally.

I'd added a pause at the assert and when the assert ran, procedure was in 
RUNNABLE state and it would fail every time. So, it is a timing issue.

I added logging of the setting of state on the procedure. There weren't many 
procedures so was easy to follow. Here is what I saw (log included emission of 
stack trace):

{code}
2019-07-02 19:50:14,941 DEBUG [PEWorker-1] locking.LockProcedure(313): LOCKED 
pid=11, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, 
namespace=namespace, type=EXCLUSIVE
2019-07-02 19:50:14,943 INFO  [PEWorker-1] procedure2.Procedure(793): SET STATE 
ON pid=11, state=WAITING_TIMEOUT, locked=true; 
org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, 
type=EXCLUSIVE
java.lang.Exception
        at 
org.apache.hadoop.hbase.procedure2.Procedure.setState(Procedure.java:793)
        at 
org.apache.hadoop.hbase.master.locking.LockProcedure.execute(LockProcedure.java:249)
        at 
org.apache.hadoop.hbase.master.locking.LockProcedure.execute(LockProcedure.java:56)
        at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:966)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1723)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
2019-07-02 19:50:14,944 INFO  [Time-limited test] procedure2.Procedure(793): 
SET STATE ON pid=11, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, 
type=EXCLUSIVE
java.lang.Exception
        at 
org.apache.hadoop.hbase.procedure2.Procedure.setState(Procedure.java:793)
        at 
org.apache.hadoop.hbase.master.locking.LockProcedure.unlock(LockProcedure.java:225)
        at 
org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.release(LockManager.java:185)
        at 
org.apache.hadoop.hbase.master.locking.TestLockManager.testMasterLockAcquire(TestLockManager.java:124)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
        at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
        at java.util.concurrent.FutureTask.run(FutureTask.java)
        at java.lang.Thread.run(Thread.java:748)
2019-07-02 19:50:15,037 DEBUG [Time-limited test] 
procedure.MasterProcedureScheduler(356): Add TableQueue(hbase:namespace, 
xlock=false sharedLock=1 size=1) to run queue because: pid=11, state=RUNNABLE, 
locked=true; org.apache.hadoop.hbase.master.locking.LockProcedure, 
namespace=namespace, type=EXCLUSIVE has lock
2019-07-02 19:53:37,062 WARN  [JvmPauseMonitor] 
util.JvmPauseMonitor$Monitor(176): Detected pause in JVM or host machine (eg 
GC): pause of approximately 201656ms
No GCs detected
2019-07-02 19:53:37,063 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(2147): Worker stuck 
PEWorker-1(pid=11), run time 3mins, 22.122sec
2019-07-02 19:53:37,063 WARN  [RS:0;localhost:57336] util.Sleeper(86): We slept 
202150ms instead of 1000ms, this is likely due to a long garbage collecting 
pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2019-07-02 19:53:37,064 INFO  [master/localhost:0.splitLogManager..Chore.1] 
hbase.ScheduledChore(176): Chore: SplitLogManager Timeout Monitor missed its 
start time
Disconnected from the target VM, address: '127.0.0.1:57316', transport: 'socket'
2019-07-02 19:53:37,063 INFO  [regionserver/localhost:0.Chore.1] 
hbase.ScheduledChore(176): Chore: MemstoreFlusherChore missed its start time
2019-07-02 19:53:37,069 INFO  [PEWorker-1] 
procedure2.ProcedureExecutor$WorkerThread(2041): ASSERT pid=11
java.lang.AssertionError
        at 
org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread.add(TimeoutExecutorThread.java:81)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1765)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
{code}

So, the worker thread goes to execute the lock procedure. It sets the state to 
WAITING_TIMEOUT... My break point trips and it holds up the timeout executor. 
Meantime the main test thread continues and releases lock which sets the 
procedure state back to RUNNABLE. I let the break point go and the worker 
thread finds the procedure state is RUNNABLE and the assertion trips.

I think the assert should tolerate RUNNABLE state too.

Looking at all ways into the add to timeout executor, all test in caller in one 
way or another for WAITING_TIMEOUT except for case where we are loading 
procedures from stores. This latter is a little unpredictable. The other 
entrances will have seen the state set to WAITING_TIMEOUT at least just before 
we get to the assert point.

Let me relax the assert some.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HBASE-22652) Flakey TestLockManager; test timed out after 780 seconds

Reply via email to