Ratis Server fails to stepdown when lost majority

ly Thu, 13 May 2021 03:26:06 -0700

Hi, I found ratis leader would not call stepDown when it is receiving write 
requests continually even though it lose connection with the majority.
I guess the problem may caused by a bug in EventProcessor.run(). Here is what 
the code looks like:


=====

while (running) {

  final StateUpdateEvent event = eventQueue.poll();

  synchronized(server) {

    if (running) {

      if (event != null) {

        event.execute();

      } else if (inStagingState()) {

        checkStaging();

      } else {

        yieldLeaderToHigherPriorityPeer();

        checkLeadership();

      }

    }

  }

}

=====

We can see that if the eventQueue is not empty, then the leader would not call 
checkLeadership() to make sure its

leader state. And this can cause Split-brain. Is that right? If so, I would 
like to create a jira and help to fix it.




Sorry I don't know where is the best place to discuss this kind of problem, so 
I may send it to different places.

Thanks,
ly

Ratis Server fails to stepdown when lost majority

Reply via email to