kezhuw commented on PR #1917: URL: https://github.com/apache/zookeeper/pull/1917#issuecomment-2345528254
The jira ZOOKEEPER-4293 has the deadlock stack dump. Below is what https://github.com/kezhuw/zookeeper/commit/0f2610b2f4116a4cd4e1b7c3ce86f4ab14aee029 reproduced: 1. Send thread which is holding the lock blocks on channel closing future. 2. Cancellation of `connectFuture` from previous run blocks on acquiring lock. This occupies the sole event loop thread. 3. Channel closing is waits for event loop thread. ```mermaid graph TD; SendThread -- blocks on --- ChannelClosing[channel closing]; ChannelClosing[channel closing] -- waits for event loop --- ConnectFutureCancellation[old connection cancellation]; ConnectFutureCancellation[old connection cancellation] -- blocks on lock --- SendThread; ``` I think the bug is multi-folds: 1. Netty event group uses only one thread. 2. Channel is closed synchronous under lock. 3. Connecting future is uncancellable. So `connectFuture.cancel(false)` is a nop for connecting future. Theoretically, we can fix the deadlock by breaking any of the above. netty/netty#13849 breaks the third. I do think bump netty version cloud fix this deadlock. But I also think it might be fragile to depend on bump dependency version to fix bug as netty version could be pinned/upgraded independently. So, personally I prefer to fix it in our side. cc @anmolnar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org