[ https://issues.apache.org/jira/browse/CURATOR-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488266#comment-17488266 ]
Shawn Weeks edited comment on CURATOR-627 at 2/7/22, 5:16 PM: -------------------------------------------------------------- I think this is a duplicate of CURATOR-561 or at least related. It appears that in 5.2.0 once suspended nothing ever actually forces it to reconnect. was (Author: absolutesantaja): I think this is a duplicate of CURATOR-561 > Curator remains in SUSPENDED state ignoring session timeout or network > recovery > ------------------------------------------------------------------------------- > > Key: CURATOR-627 > URL: https://issues.apache.org/jira/browse/CURATOR-627 > Project: Apache Curator > Issue Type: Bug > Components: Framework > Affects Versions: 4.2.0 > Environment: OS: Linux version 4.17.3-1.el7.elrepo.x86_64 > Reporter: Bai Yu > Priority: Major > Attachments: log analysis.md, log4j.log, pom.xml > > > h1. Description > Our program encountered a problem where curator got stuck in the SUSPENDED > state. It kept injecting "session expire" event, generating logs like this > every 15s (same as session timeout): > {quote}Session timeout has elapsed while SUSPENDED. Injecting a session > expiration{quote} > However, curator never transited to the LOST state after session timeout, nor > transied to the RECONNECTED state upon network recovery. > > According to logs containing "zookeeper" or "curator" (see attachment > "log4j.log"), related events are as follows: > (All events happened in date 2021.12.05) > * 23:34:07,122: ZooKeeper not heard from server, thus curator transited to > the SUSPENDED state. > * 23:34:09,128: ZooKeeper opening socket to zk server. > * 23:34:22,223: curator injected a "session expiration" event, but had not > transited to the LOST state at that time; never transited in the future. > * 23:34:22,885: ZooKeeper opened socket to zk server, and received reponse > indicating session expiration; thus, the current ZooKeeper object was closed. > However, another ZooKeeper object was not created as expected. > * 23:34:37,224 and later: curator kept injecting "session expiration" events > every session timeout (15s). The log was filled with lines like this: > "Session timeout has elapsed while SUSPENDED. Injecting a session expiration." > In summary, curator stayed in the SUSPENED state and never transited to the > LOST or RECONNECTED state. Besides, the underlying ZooKeeper object was never > recreated according to logs and the "jstack" command. For a more detailed > analysis, please refer to attachment "log analysis.md". > > The problem above was encountered only once in our testing environment within > months, and has never occurred in our production enviromnent. We failed to > find out how to reproduce, but suspect there is a racing condition when these > events happen simultaneously: while curator is injecting a "session expire" > event, the underlying "ClientCnxnSocketNIO" has just reconnected to a zk > server and got a "session expire" response. > h1. Environment > OS: Linux version 4.17.3-1.el7.elrepo.x86_64 > > Project type: maven > Curator version: 4.2.0 > > Session timeout: 15s, which is equal to that negotiated with the server. > Retry policy: no retry, where RetryPolicy#allowRetry() always returns false. > That's because we handle retry at application level. Write operations should > NOT be retried immediately after reconnected, until some extra validations > pass. > > Our program creates exactly one "CuratorFramework" instance connecting to the > only zk server in the testing environment. The server's connection string is > "6x.xx.xx.27:2181", with some digits consealed for security. Ip addresses in > logs are also such processed. > > ZkClient is imported for forward compatiblility with some legacy codes (still > dependent on ZkClient), but it is bridged to curator instead of connecting to > zk server itself. For more details about dependencies, please refer to file > "pom.xml" in attachments. > h1. Attachments > log4j.log: log of our program, with ip addresses and zk paths consealed. > log analysis.md: a more detailed analysis about the logs. > pom.xml: part of file "pom.xml", showing zookeeper-related dependencies. -- This message was sent by Atlassian Jira (v8.20.1#820001)