[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper

Flavio Junqueira (JIRA) Tue, 05 Jan 2016 04:03:10 -0800

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082931#comment-15082931
 ]


Flavio Junqueira commented on ZOOKEEPER-2347:
---------------------------------------------

[~rakeshr] thanks for the update. It looks much better, I have tested and the 
new test case does hang without the other changes, but there are a few small 
points I want to raise:

# Do we really need a timeout of 90s? I'd rather have something like 30s or 
less.
# Typo in {{LOG.error("Exception while waiting to proess req", e);}}
# Please add a description of the dependency cycle that we are testing for. For 
example, in step 7, you could say that we are testing that 
SyncRequestProcessor#shutdown holds a lock and waits on FinalRequestProcessor 
to complete a pending operation, which in turn also needs the ZooKeeperServer 
lock. This is to emphasize where the problem was and make it very clear.
# Replace {{"Waiting for FinalReqProcessor to be called"}} with {{"Waiting for 
FinalRequestProcessor to start processing request"}} and {{"Waiting for 
SyncReqProcessor#shutdown to be called"}} with {{"Waiting for 
SyncRequestProcessor to shut down"}}.
# There are a couple of exceptions that we catch but do nothing because we rely 
on the timeout. It is better to simply fail the test case directly if it is a 
failure rather than rely on a timeout. If you don't like the idea of calling 
{{fail()}} from an auxiliary class, then we need to at least propagate the 
exception so that we can catch and fail rather than wait.

I also would feel more comfortable if we get another review here. I'm fairly 
confident, but given that we've missed this issue before, I'd rather have 
another +1 before we check in.

> Deadlock shutting down zookeeper
> --------------------------------
>
>                 Key: ZOOKEEPER-2347
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2347
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.7
>            Reporter: Ted Yu
>            Assignee: Rakesh R
>            Priority: Blocker
>             Fix For: 3.4.8
>
>         Attachments: ZOOKEEPER-2347-br-3.4.patch, 
> ZOOKEEPER-2347-br-3.4.patch, testSplitLogManager.stack
>
>
> HBase recently upgraded to zookeeper 3.4.7
> In one of the tests, TestSplitLogManager, there is reproducible hang at the 
> end of the test.
> Below is snippet from stack trace related to zookeeper:
> {code}
> "main-EventThread" daemon prio=5 tid=0x00007fd27488a800 nid=0x6f1f waiting on 
> condition [0x000000011834b000]
>    java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00000007c5b8d3a0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main-SendThread(localhost:59510)" daemon prio=5 tid=0x00007fd274eb4000 
> nid=0x9513 waiting on condition [0x0000000118042000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> "SyncThread:0" prio=5 tid=0x00007fd274d02000 nid=0x730f waiting for monitor 
> entry [0x00000001170ac000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
>   - waiting to lock <0x00000007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> "main-EventThread" daemon prio=5 tid=0x00007fd2753a3800 nid=0x711b waiting on 
> condition [0x0000000117a30000]
>    java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00000007c9b106b8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main" prio=5 tid=0x00007fd276000000 nid=0x1903 in Object.wait() 
> [0x0000000108aa1000]
>    java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x00000007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x00000007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:213)
>   at 
> org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:770)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:478)
>   - locked <0x00000007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.shutdown(NIOServerCnxnFactory.java:266)
>   at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.shutdown(MiniZooKeeperCluster.java:301)
> {code}
> Note the address (0x00000007c5b66400) in the last hunk which seems to 
> indicate some form of deadlock.
> According to Camille Fournier:
> We made shutdown synchronized. But decrementing the requests is
> also synchronized and called from a different thread. So yeah, deadlock.
> This came in with ZOOKEEPER-1907



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper

Reply via email to