Mikhail Valiev created CURATOR-525:
--------------------------------------

             Summary: There is a race condition in Curator which might lead to 
fake SUSPENDED event and ruin CuratorFrameworkImpl inner state 
                 Key: CURATOR-525
                 URL: https://issues.apache.org/jira/browse/CURATOR-525
             Project: Apache Curator
          Issue Type: Bug
          Components: Framework
    Affects Versions: 4.2.0
            Reporter: Mikhail Valiev
         Attachments: CuratorFrameworkTest.java, 
background-thread-infinite-loop.png, curator-race-condition.png, 
event-watcher-thread.png

This was originally found in the 2.11.1 version of Curator, but I tested the 
latest release as well, and the issue is still there.

The issue is tied to guaranteed deletes and how it loops infinitely, if called 
when there is no connection:

client.delete().guaranteed().forPath(ourPath); 

[https://curator.apache.org/apidocs/org/apache/curator/framework/api/GuaranteeableDeletable.html]

This schedules a background operation which attempts to remove the node in 
infinite loop. Each time a background operation fails due to connection loss it 
performs a check (validateConnection() function) to see if the main thread is 
already aware of connection loss, and if it's not - raises the connection loss 
event. The problem is that this peace of code is also executed by the event 
watcher thread when connection events are happening - this leads to race 
condition. So when connection is restored it's easily possible for the main 
thread to raise RECONNECTED event and after that for background thread to raise 
SUSPENDED event.

We might get unlucky and get a "phantom" SUSPENDED event. It breaks Curator 
inner Connection state and leads to curator behaving unpredictably

Attached some illustrations and Unit test to reproduce the issue. (Put debug 
point in validateConnection() )

*Possible solution*: in CuratorFrameworkImpl class adjust the processEvent() 
function and add the following:

if(event.getType() == CuratorEventType.SYNC) {

connectionStateManager.addStateChange(ConnectionState.RECONNECTED);

}

If this is a same state as before - it will be ignored, if background operation 
succeeded, but we are in SUSPENDED state - this would repair the Curator state 
and raise RECONNECTED event.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to