[ 
https://issues.apache.org/jira/browse/CURATOR-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhanglu153 updated CURATOR-722:
-------------------------------
    Attachment: ConnectionState.patch

> Zookeeper connection leak after session expiration
> --------------------------------------------------
>
>                 Key: CURATOR-722
>                 URL: https://issues.apache.org/jira/browse/CURATOR-722
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 2.7.1, 2.12.0, 2.13.0
>            Reporter: zhanglu153
>            Priority: Major
>         Attachments: ConnectionState.patch, testCuratorClient.java
>
>
> *User testing code description:*
> The test code is in [^testCuratorClient.java].
> When creating the Curator client,the user added the CuratorListener in the 
> test code to listen for the AuthFailed event generated when the client SASL 
> authentication fails.
> When the listener detects the AuthFailed event, it will close the Curator 
> client, delete the created node, and enter a while loop.
> In the while loop, rebuild a Curator client, add the CuratorListener, start 
> the Curator client, and create a znode with sasl permission under the /test 
> node with sasl permission. After successful execution, the listener code 
> exits.
> When kerberos remains unavailable, this loop will continue to block the 
> handling of AuthFailed events.
> Add a lock to this listener to prevent the AuthFailed event that occurs after 
> the newly established Curator client starts from entering this listener 
> again, avoiding unnecessary while loops.
> *Scenario of Connection Leakage Issue:*
>  * The Zookeeper client successfully connected to the zookeeper server with 
> session ID 0x0 and created a znode with SASL permission.
>  * Create an exception, temporarily disconnect the session from the server.
>  * The session state in the Curator changes to suspended. The client is 
> preparing to reconnect to the server. At 
> org.apache.zookeeper.ClientCnxn.SendThread#startConnect method debugs 
> breakpoints and stops the kerberos service, waiting for the server to 
> determine that the session has expired.
>  * Continue executing code after the kerberos service has stopped and the 
> session has expired. The client will continue connection to Zookeeper server 
> without SASL authentication and send an AuthFailed event.
>  * The listener set by the user will listen for the AuthFailed event and 
> start processing the code logic in the listener.
>  * Before closing the Curator client in the listener, the session is found to 
> have expired, and the client sends the Expired event and eventOfDeath.
>  * At the same time, in the Curator framework, 
> org.apache.curator.ConnectionState#checkTimeouts method detected a connection 
> timeout and called the reset method to close the old session 0x0. Since 
> session 0x0 has expired and the connection status has been set to CLOSED, 
> when calling close() to release resources, this.cnxn.getState().isAlive() 
> will return fasle. It was found that the session 0x0 had been closed, and a 
> new Zookeeper object was created to establish a new session 0x1.
>  * Restore the kerberos service.
>  * The listener set by the user will close the session 0x1 of the Curator 
> client, rebuild a new Curator client, start a new session 0x2, and 
> successfully create a znode with SASL permission.
>  * The SendThread of session 0x0 was closed after the session expired, and 
> EventThread has not completed execution yet because the eventOfDeath object 
> has not been processed yet. The old org.apache.curator.ConnectionState object 
> in the Curator has been held by Expired event. When the Curator receives the 
> Expired event, it will call the reset method again, causing the Curator 
> object that has already called closeAndClear to establish a new session 0x3.
> At this point, both the leaked session 0x3 and the session 0x2 which the user 
> needs to rebuild are connected to the server simultaneously.
> There is a zookeeper connection leak when using a Curator in this scenario. 
> The Curator framework should not allow the framework to call the reset method 
> again to restart a leaked connection after receiving an Expired event, after 
> the user has already called the close method to close and clean up resources. 
> The priority of users calling the close method should be higher than the 
> framework's handling of Expired events. At the same time, I found that there 
> is a similar issue with curator4.x in CURATOR-437 zookeeper connection leak 
> when session expires.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to