[jira] [Commented] (CURATOR-525) There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state

Jacob Schlather (Jira) Tue, 10 Sep 2019 14:44:45 -0700


    [ 
https://issues.apache.org/jira/browse/CURATOR-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927054#comment-16927054
 ]


Jacob Schlather commented on CURATOR-525:
-----------------------------------------

I will try to reproduce but I only see this very infrequently, here are my 
application logs. I see a session loss event, then a reconnect and then shortly 
after reconnecting a background operation sees the session lost event again and 
then Curator is totally broken until the session happens to time out. This 
seems to resemble the scenario described above, but with session lost instead 
of connection suspension.

 

2019-09-06 05:02:07.705 [EventThread] WARN org.apache.curator.ConnectionState - 
Session expired event received
2019-09-06 05:02:07.705 [SendThread(zk-node-2.k8s.run:2181)] INFO 
org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, 
session 0x101392be97d0195 has expired, closing socket connection
2019-09-06 05:02:07.707 [EventThread] INFO o.a.c.f.state.ConnectionStateManager 
- State change: LOST
2019-09-06 05:02:07.707 [EventThread] INFO org.apache.zookeeper.ClientCnxn - 
EventThread shut down for session: 0x101392be97d0195
2019-09-06 05:02:07.708 [SendThread(zk-node-2.k8s.run:2181)] INFO 
org.apache.zookeeper.ClientCnxn - Opening socket connection to server 
zk-node-2.k8s.run/172.19.20.92:2181. Will not attempt to authenticate using 
SASL (unknown error)
2019-09-06 05:02:07.709 [SendThread(zk-node-2.k8s.run:2181)] INFO 
org.apache.zookeeper.ClientCnxn - Socket connection established to 
zk-node-2.k8s.run/172.19.20.92:2181, initiating session
2019-09-06 05:02:07.711 [SendThread(zk-node-2.k8s.run:2181)] INFO 
org.apache.zookeeper.ClientCnxn - Session establishment complete on server 
zk-node-2.k8s.run/172.19.20.92:2181, sessionid = 0x301a029da8901da, negotiated 
timeout = 2000
2019-09-06 05:02:07.711 [EventThread] INFO o.a.c.f.state.ConnectionStateManager 
- State change: RECONNECTED
2019-09-06 05:02:07.806 [SendThread(zk-node-2.k8s.run:2181)] INFO 
o.a.c.f.state.ConnectionStateManager - State change: LOST
2019-09-06 05:02:07.806 [SendThread(zk-node-2.k8s.run:2181)] ERROR 
o.a.c.f.imps.CuratorFrameworkImpl - Background operation retry gave up
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired

> There is a race condition in Curator which might lead to fake SUSPENDED event 
> and ruin CuratorFrameworkImpl inner state 
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-525
>                 URL: https://issues.apache.org/jira/browse/CURATOR-525
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 4.2.0
>            Reporter: Mikhail Valiev
>            Assignee: Cameron McKenzie
>            Priority: Critical
>         Attachments: CuratorFrameworkTest.java, 
> background-thread-infinite-loop.png, curator-race-condition.png, 
> event-watcher-thread.png
>
>
> This was originally found in the 2.11.1 version of Curator, but I tested the 
> latest release as well, and the issue is still there.
> The issue is tied to guaranteed deletes and how it loops infinitely, if 
> called when there is no connection:
> client.delete().guaranteed().forPath(ourPath); 
> [https://curator.apache.org/apidocs/org/apache/curator/framework/api/GuaranteeableDeletable.html]
> This schedules a background operation which attempts to remove the node in 
> infinite loop. Each time a background operation fails due to connection loss 
> it performs a check (validateConnection() function) to see if the main thread 
> is already aware of connection loss, and if it's not - raises the connection 
> loss event. The problem is that this peace of code is also executed by the 
> event watcher thread when connection events are happening - this leads to 
> race condition. So when connection is restored it's easily possible for the 
> main thread to raise RECONNECTED event and after that for background thread 
> to raise SUSPENDED event.
> We might get unlucky and get a "phantom" SUSPENDED event. It breaks Curator 
> inner Connection state and leads to curator behaving unpredictably
> Attached some illustrations and Unit test to reproduce the issue. (Put debug 
> point in validateConnection() )
> *Possible solution*: in CuratorFrameworkImpl class adjust the processEvent() 
> function and add the following:
> if(event.getType() == CuratorEventType.SYNC) {
> connectionStateManager.addStateChange(ConnectionState.RECONNECTED);
> }
> If this is a same state as before - it will be ignored, if background 
> operation succeeded, but we are in SUSPENDED state - this would repair the 
> Curator state and raise RECONNECTED event.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (CURATOR-525) There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state

Reply via email to