[
https://issues.apache.org/jira/browse/RATIS-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765711#comment-17765711
]
Tsz-wo Sze commented on RATIS-1048:
-----------------------------------
[~burcukozkan], thanks a lots for testing Ratis! I somehow have overlooked this
issue.
According to the logs below, s2 did become the leader at one point. However,
not sure why the test failed at the end.
{code:java}
2020-08-30 21:11:56,836 INFO impl.RaftServerImpl
(ServerState.java:setLeader(255)) - s2@group-60D9F845C708: change Leader from
null to s2 at term 2 for becomeLeader, leader elected after 724ms
...
2020-08-30 21:11:56,870 INFO impl.RoleInfo (RoleInfo.java:updateAndGet(143)) -
s2: start LeaderState
...
2020-08-30 21:11:56,913 INFO impl.RaftServerImpl
(ServerState.java:setLeader(255)) - s1@group-60D9F845C708: change Leader from
null to s2 at term 2 for appendEntries, leader elected after 801ms
2020-08-30 21:11:56,914 INFO impl.RaftServerImpl
(ServerState.java:setLeader(255)) - s0@group-60D9F845C708: change Leader from
null to s2 at term 2 for appendEntries, leader elected after 802ms
...
2020-08-30 21:11:57,236 INFO ratis.RaftTestUtil
(RaftTestUtil.java:waitForLeader(104)) - printing ALL groups
s0: RUNNING FOLLOWER s0@group-60D9F845C708:t2, leader=s2, voted=s2,
raftlog=s0@group-60D9F845C708-SegmentedRaftLog:OPENED:c0,f0,i0, conf=0:
[s0:0.0.0.0:44625, s1:0.0.0.0:43949, s2:0.0.0.0:42577], old=null RUNNING
s1: RUNNING FOLLOWER s1@group-60D9F845C708:t2, leader=s2, voted=s2,
raftlog=s1@group-60D9F845C708-SegmentedRaftLog:OPENED:c0,f0,i0, conf=0:
[s0:0.0.0.0:44625, s1:0.0.0.0:43949, s2:0.0.0.0:42577], old=null RUNNING
s2: RUNNING LEADER s2@group-60D9F845C708:t2, leader=s2, voted=s2,
raftlog=s2@group-60D9F845C708-SegmentedRaftLog:OPENED:c0,f0,i0, conf=0:
[s0:0.0.0.0:44625, s1:0.0.0.0:43949, s2:0.0.0.0:42577], old=null RUNNING
{code}
> Ratis cluster fails to elect a leader when some messages are dropped
> --------------------------------------------------------------------
>
> Key: RATIS-1048
> URL: https://issues.apache.org/jira/browse/RATIS-1048
> Project: Ratis
> Issue Type: Bug
> Reporter: Burcu Ozkan
> Priority: Major
> Attachments: logs.txt, msgs.txt
>
>
> I am testing fault tolerance of Ratis, more specifically whether it can
> tolerate random message losses. Simply, I drop some of the messages and do
> not deliver them to the recipient.
> In some tests, I observe executions in which the Ratis servers cannot elect a
> leader. The servers continuously start leader election but none of them
> succeed.
> You can find the execution logs together with the list of exchanged and
> dropped messages (the messages marked by "-D" are dropped) in the attachments.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)