Hi,Snehasish In your scenario, if you kill n3, which is acting as a follower, the cluster will have 3 non-listener and 1 listener, with one follower already offline. At this point, the majority situation becomes quite risky because if any non-listener goes down from here, the Raft group will not be able to form a quorum and elect a new leader.
Although you have promoted n4 to a listener and removed n3, before this request completes, the majority of the Raft group is still 2. Therefore, after you kill n1, a new leader cannot be elected. In my understanding, this phenomenon is not a bug and aligns with the expected behavior of the algorithm. If you want to test how to safely promote a listener to a follower, make sure that before the promotion request completes (you can confirm this with shell commands as suggested by sze), the current leader and follower members maintain the majority online. Otherwise, the promotion action will not be successful, and this is not a problem with the implementation but a boundary of the Raft algorithm. Feel free to do more testing on this feature of Ratis. If you encounter the following issues, it would indicate that there is indeed a problem with the implementation, and we welcome discussions and contributions: * You find that even with the majority of leader and follower members online, you still cannot successfully promote a listener to a follower. * In your case, because the majority was not maintained, the member change failed. But after you restart n1 or n3 and re-establish the majority, the Raft group still cannot elect a leader or elects a leader but fails to perform member changes. We look forward to your testing. Best -------------- Xinyu Tan On 2025/12/29 10:53:40 Snehasish Roy wrote: > Hello everyone, > > Happy Holidays. This is my first email to this community so kindly excuse > me for any mistakes. > > I initially started a 3 node Ratis Cluster and then added a listener in the > Cluster using the setConfiguration(List.of(n1,n2,n3), List.of(n4)) based on > the following documentation > https://jojochuang.github.io/ratis-site/docs/developer-guide/listeners > > ``` > INFO [2025-12-29 15:57:01,887] [n1-server-thread1] [RaftServer$Division]: > n1@group-ABB3109A44C2-LeaderStateImpl: startSetConfiguration > SetConfigurationRequest:client-044D31187FB4->n1@group-ABB3109A44C2, cid=3, > seq=null, RW, null, SET_UNCONDITIONALLY, servers:[n1|0.0.0.0:9000, n2| > 0.0.0.0:9001, n3|0.0.0.0:9002], listeners:[n4|0.0.0.0:9003] > ``` > > Then I killed one of the Ratis follower node (n3) followed by promoting the > listener to the follower using setConfiguration(List.of(n1,n2,n4)) command > to maintain the cluster size of 3. > Please note that n3 has been removed from the list of followers and there > are no more listeners in the cluster and there were no failures observed > while issuing the command. > > ``` > INFO [2025-12-29 16:02:54,227] [n1-server-thread2] [RaftServer$Division]: > n1@group-ABB3109A44C2-LeaderStateImpl: startSetConfiguration > SetConfigurationRequest:client-2438CA24E2F3->n1@group-ABB3109A44C2, cid=4, > seq=null, RW, null, SET_UNCONDITIONALLY, servers:[n1|0.0.0.0:9000, n2| > 0.0.0.0:9001, n4|0.0.0.0:9003], listeners:[] > ``` > > Then I killed the leader instance n1. Post which n2 attempted to become a > leader and starts asking for votes from n1 and n4. There is no response > from n1 as it's not alive and n4 is rejecting the pre_vote request from n2 > because it still thinks it's a listener. > > Logs from n2 > ``` > INFO [2025-12-29 16:04:10,051] [n2@group-ABB3109A44C2-LeaderElection30] > [LeaderElection]: n2@group-ABB3109A44C2-LeaderElection30 PRE_VOTE round 0: > submit vote requests at term 1 for conf: {index: 15, cur=peers:[n1| > 0.0.0.0:9000, n2|0.0.0.0:9001, n4|0.0.0.0:9003]|listeners:[], old=null} > INFO [2025-12-29 16:04:10,052] [n2@group-ABB3109A44C2-LeaderElection30] > [LeaderElection]: n2@group-ABB3109A44C2-LeaderElection30 got exception when > requesting votes: java.util.concurrent.ExecutionException: > org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io > exception > INFO [2025-12-29 16:04:10,054] [n2@group-ABB3109A44C2-LeaderElection30] > [LeaderElection]: n2@group-ABB3109A44C2-LeaderElection30: PRE_VOTE REJECTED > received 1 response(s) and 1 exception(s): > INFO [2025-12-29 16:04:10,054] [n2@group-ABB3109A44C2-LeaderElection30] > [LeaderElection]: Response 0: n2<-n4#0:FAIL-t1-last:(t:1, i:16) > INFO [2025-12-29 16:04:10,054] [n2@group-ABB3109A44C2-LeaderElection30] > [LeaderElection]: Exception 1: java.util.concurrent.ExecutionException: > org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io > exception > ``` > > > Due to lack of leader, the cluster is no more stable. > > Logs from n4 > ``` > INFO [2025-12-29 16:05:03,405] [grpc-default-executor-2] > [RaftServer$Division]: n4@group-ABB3109A44C2: receive requestVote(PRE_VOTE, > n2, group-ABB3109A44C2, 1, (t:1, i:16)) > INFO [2025-12-29 16:05:03,405] [grpc-default-executor-2] [VoteContext]: > n4@group-ABB3109A44C2-LISTENER: reject PRE_VOTE from n2: this server is a > listener, who is a non-voting member > INFO [2025-12-29 16:05:03,405] [grpc-default-executor-2] > [RaftServer$Division]: n4@group-ABB3109A44C2 replies to PRE_VOTE vote > request: n2<-n4#0:FAIL-t1-last:(t:1, i:16). Peer's state: > n4@group-ABB3109A44C2:t1, leader=n1, voted=null, > raftlog=Memoized:n4@group-ABB3109A44C2-SegmentedRaftLog:OPENED:c16:last(t:1, > i:16), conf=conf: {index: 15, cur=peers:[n1|0.0.0.0:9000, n2|0.0.0.0:9001, > n4|0.0.0.0:9003]|listeners:[], old=null} > ``` > > So my question is how to correctly promote a listener to a follower? Did I > miss some step? Or is there a bug in the code? If it's the latter, I would > be happy to contribute. Please let me know if you need any more debugging > information. > > Thank you again for looking into this issue. > > > Regards, > Snehasish >
