[ 
https://issues.apache.org/jira/browse/CURATOR-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488266#comment-17488266
 ] 

Shawn Weeks edited comment on CURATOR-627 at 2/7/22, 5:16 PM:
--------------------------------------------------------------

I think this is a duplicate of CURATOR-561 or at least related. It appears that 
in 5.2.0 once suspended nothing ever actually forces it to reconnect.


was (Author: absolutesantaja):
I think this is a duplicate of CURATOR-561

> Curator remains in SUSPENDED state ignoring session timeout or network 
> recovery
> -------------------------------------------------------------------------------
>
>                 Key: CURATOR-627
>                 URL: https://issues.apache.org/jira/browse/CURATOR-627
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 4.2.0
>         Environment: OS: Linux version 4.17.3-1.el7.elrepo.x86_64
>            Reporter: Bai Yu
>            Priority: Major
>         Attachments: log analysis.md, log4j.log, pom.xml
>
>
> h1. Description
> Our program encountered a problem where curator got stuck in the SUSPENDED 
> state. It kept injecting "session expire" event, generating logs like this 
> every 15s (same as session timeout):
> {quote}Session timeout has elapsed while SUSPENDED. Injecting a session 
> expiration{quote}
> However, curator never transited to the LOST state after session timeout, nor 
> transied to the RECONNECTED state upon network recovery. 
>  
> According to logs containing "zookeeper" or "curator" (see attachment 
> "log4j.log"), related events are as follows:
> (All events happened in date 2021.12.05)
>  * 23:34:07,122: ZooKeeper not heard from server, thus curator transited to 
> the SUSPENDED state.
>  * 23:34:09,128: ZooKeeper opening socket to zk server.
>  * 23:34:22,223: curator injected a "session expiration" event, but had not 
> transited to the LOST state at that time; never transited in the future.
>  * 23:34:22,885: ZooKeeper opened socket to zk server, and received reponse 
> indicating session expiration; thus, the current ZooKeeper object was closed. 
> However, another ZooKeeper object was not created as expected.
>  * 23:34:37,224 and later: curator kept injecting "session expiration" events 
> every session timeout (15s). The log was filled with lines like this: 
> "Session timeout has elapsed while SUSPENDED. Injecting a session expiration."
> In summary, curator stayed in the SUSPENED state and never transited to the 
> LOST or RECONNECTED state. Besides, the underlying ZooKeeper object was never 
> recreated according to logs and the "jstack" command. For a more detailed 
> analysis, please refer to attachment "log analysis.md".
>  
> The problem above was encountered only once in our testing environment within 
> months, and has never occurred in our production enviromnent. We failed to 
> find out how to reproduce, but suspect there is a racing condition when these 
> events happen simultaneously: while curator is injecting a "session expire" 
> event, the underlying "ClientCnxnSocketNIO" has just reconnected to a zk 
> server and got a "session expire" response.
> h1. Environment
> OS: Linux version 4.17.3-1.el7.elrepo.x86_64
>  
> Project type: maven
> Curator version: 4.2.0
>  
> Session timeout: 15s, which is equal to that negotiated with the server.
> Retry policy: no retry, where RetryPolicy#allowRetry() always returns false. 
> That's because we handle retry at application level. Write operations should 
> NOT be retried immediately after reconnected, until some extra validations 
> pass.
>  
> Our program creates exactly one "CuratorFramework" instance connecting to the 
> only zk server in the testing environment. The server's connection string is 
> "6x.xx.xx.27:2181", with some digits consealed for security. Ip addresses in 
> logs are also such processed. 
>  
> ZkClient is imported for forward compatiblility with some legacy codes (still 
> dependent on ZkClient), but it is bridged to curator instead of connecting to 
> zk server itself. For more details about dependencies, please refer to file 
> "pom.xml" in attachments.
> h1. Attachments
> log4j.log: log of our program, with ip addresses and zk paths consealed.
> log analysis.md: a more detailed analysis about the logs.
> pom.xml: part of file "pom.xml", showing zookeeper-related dependencies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to