kezhuw commented on PR #1917:
URL: https://github.com/apache/zookeeper/pull/1917#issuecomment-2345528254

   The jira ZOOKEEPER-4293 has the deadlock stack dump. Below is what 
https://github.com/kezhuw/zookeeper/commit/0f2610b2f4116a4cd4e1b7c3ce86f4ab14aee029
 reproduced:
   1. Send thread which is holding the lock blocks on channel closing future.
   2. Cancellation of `connectFuture` from previous run blocks on acquiring 
lock. This occupies the sole event loop thread.
   3. Channel closing is waits for event loop thread.
   
   ```mermaid
   graph TD;
       SendThread -- blocks on --- ChannelClosing[channel closing];
       ChannelClosing[channel closing] -- waits for event loop --- 
ConnectFutureCancellation[old connection cancellation];
       ConnectFutureCancellation[old connection cancellation] -- blocks on lock 
--- SendThread;
   ```
   
   I think the bug is multi-folds:
   1. Netty event group uses only one thread.
   2. Channel is closed synchronous under lock.
   3. Connecting future is uncancellable. So `connectFuture.cancel(false)` is a 
nop for connecting future.
   
   Theoretically, we can fix the deadlock by breaking any of the above. 
netty/netty#13849 breaks the third. I do think bump netty version cloud fix 
this deadlock. But I also think it might be fragile to depend on bump 
dependency version to fix bug as netty version could be pinned/upgraded 
independently. So, personally I prefer to fix it in our side.
   
   cc @anmolnar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to