[ https://issues.apache.org/jira/browse/ZOOKEEPER-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902543#comment-17902543 ]
mutu commented on ZOOKEEPER-4817: --------------------------------- I try to trace the thread of NIOServerCnxn. I found that the thread ID of NIOServerCnxn in the node2 changes when the stuck time exceeds 20s. Hence, i guess that the thread of NIOServerCnxn may be killed by JVM due to some reasons. Additionally, the system logs show that the server is still alive. Do you have any suggestion to figure out this problem? Thanks. > Client disconnection warning is missed in system log sometimes. > --------------------------------------------------------------- > > Key: ZOOKEEPER-4817 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4817 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.10.0 > Reporter: mutu > Priority: Major > Attachments: system1_20s.log, system1_25s.log, system2_20s.log, > system2_25s.log, system3_20s.log, system3_25s.log > > > Recently, we encounter an confused issue. The client disconnection warning > disappears in system log. However, sometimes, this message appears in system > log. There is a cluster consisting of three node. A client sends many > creation requests and then read the node created by the first request. The > client read operation failed. We watch the system log. Sometimes, there is a > client disconnection warning. Sometimes, there is not. This incomplete system > log mislead client judgement on the problem. > After investigating, when NIOServerCnxn.doIO is stuck in any IO point in this > function and the stuck time exceeds 20s, the client disconnection warning > will disappear. If the stuck time is less than 20s, the client disconnection > warning will appear in system log. > We find that the root cause is that selectorThread is set as the daemon > thread. When one node encounter the fail-slow nic, the client disconnects > with the node. If the NIOServerCnxn.doIO is stuck and the stuck time exceeds > 20s, the corresponding selectorThread will be killed by JVM. Hence, the > client disconnection warning is missed. > Attached logs(20s) contain CancelledKeyException, but logs(25) do not contain. > Are there any comments to figure out this issues and improve the > diagnosability of ZooKeeper? I will very appreciate them. -- This message was sent by Atlassian Jira (v8.20.10#820010)