Attila Doroszlai created RATIS-1960:
---------------------------------------

             Summary: Follower may be incorrectly marked as having caught up
                 Key: RATIS-1960
                 URL: https://issues.apache.org/jira/browse/RATIS-1960
             Project: Ratis
          Issue Type: Bug
          Components: server
            Reporter: Attila Doroszlai
            Assignee: Attila Doroszlai


I think there is a race condition in {{LeaderStateImpl#checkStaging}}:

{code:title=https://github.com/apache/ratis/blob/0d963e2ceec9045497bea1e4e2a939e84f36242a/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L813-L829}
      // check progress for the new followers
      final EnumSet<BootStrapProgress> reports = getLogAppenders()
          .map(LogAppender::getFollower)
          .filter(follower -> !isCaughtUp(follower))
          .map(follower -> checkProgress(follower, commitIndex))
          .collect(Collectors.toCollection(() -> 
EnumSet.noneOf(BootStrapProgress.class)));
      if (reports.contains(BootStrapProgress.NOPROGRESS)) {
        stagingState.fail(BootStrapProgress.NOPROGRESS);
      } else if (!reports.contains(BootStrapProgress.PROGRESSING)) {
        // all caught up!
        applyOldNewConf();
        getLogAppenders()
            .map(LogAppender::getFollower)
            .filter(f -> server.getRaftConf().containsInConf(f.getId()))
            .map(FollowerInfoImpl.class::cast)
            .forEach(FollowerInfoImpl::catchUp);
      }
{code}

Followers are collected/iterated twice:
 * check progress status
 * mark as having caught up

The race condition is between the thread executing {{checkStaging}} 
({{LeaderStateImpl}}), and the thread setting the stage and adding new 
followers in {{startSetConfiguration}} (which can be client thread):

{code:title=https://github.com/apache/ratis/blob/0d963e2ceec9045497bea1e4e2a939e84f36242a/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L503-L511}
    // set the staging state
    this.stagingState = configurationStagingState;

    if (newPeers.isEmpty() && newListeners.isEmpty()) {
      applyOldNewConf();
    } else {
      // update the LeaderState's sender list
      addAndStartSenders(newPeers);
      addAndStartSenders(newListeners);
{code}

If the follower is incorrectly marked as having caught up, it will not 
transition from starting to running when it receives {{appendEntries}}:

{code:title=bad (initializing=true)}
[omNode-bootstrap-1-server-thread2] DEBUG server.RaftServer$Division 
(RaftServerImpl.java:logAppendEntries(1504)) - 
omNode-bootstrap-1@group-0AAC5367B30E: receive appendEntries(omNode-1, 1, (t:1, 
i:0), 8, true, commits:[omNode-1:c8, omNode-bootstrap-1:c0], cId:1, entries: ...
{code}

{code:title=good (initializing=false)}
[omNode-bootstrap-1-server-thread1] DEBUG util.LifeCycle 
(LifeCycle.java:validate(116)) - omNode-bootstrap-1: STARTING -> RUNNING

[omNode-bootstrap-1-server-thread1] DEBUG server.RaftServer$Division 
(RaftServerImpl.java:logAppendEntries(1504)) - 
omNode-bootstrap-1@group-0AAC5367B30E: receive appendEntries(omNode-1, 1, (t:1, 
i:0), 8, false, commits:[omNode-1:c8, omNode-bootstrap-1:c0], cId:1, entries: 
...
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to