Hi, I found ratis leader would not call stepDown when it is receiving write
requests continually even though it lose connection with the majority.
I guess the problem may caused by a bug in EventProcessor.run(). Here is what
the code looks like:
=====
while (running) {
final StateUpdateEvent event = eventQueue.poll();
synchronized(server) {
if (running) {
if (event != null) {
event.execute();
} else if (inStagingState()) {
checkStaging();
} else {
yieldLeaderToHigherPriorityPeer();
checkLeadership();
}
}
}
}
=====
We can see that if the eventQueue is not empty, then the leader would not call
checkLeadership() to make sure its
leader state. And this can cause Split-brain. Is that right? If so, I would
like to create a jira and help to fix it.
Sorry I don't know where is the best place to discuss this kind of problem, so
I may send it to different places.
Thanks,
ly