[
https://issues.apache.org/jira/browse/RATIS-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze updated RATIS-794:
-----------------------------
Attachment: r794_20200124.patch
> Ratils leader should retry append requests based on follower commit info in
> case of intermittent append failures
> ----------------------------------------------------------------------------------------------------------------
>
> Key: RATIS-794
> URL: https://issues.apache.org/jira/browse/RATIS-794
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Shashikant Banerjee
> Assignee: Tsz-wo Sze
> Priority: Major
> Fix For: 0.5.0
>
> Attachments: r794_20200122.patch, r794_20200124.patch
>
>
> During Ozone testing, it was observed that a leader election happens in
> between the test , where a follower has caught to a certain index 313. The
> new leader starts sends an append request to the follower which fails with
> grpc Exception. This leads to leader reset the connection and start from the
> beginning (index 1).
>
>
> {code:java}
> 2020-01-13 14:56:32,995 INFO org.apache.ratis.server.impl.RaftServerImpl:
> 0.0.0.0:9858@group-4F125BF42C14: changes role from CANDIDATE to LEADER at
> term 7 for changeToLeader
> 2020-01-13 14:56:32,995 INFO
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
> Leader change notification received for group: group-4F125BF42C14 with new
> leaderId: ed90869c-317e-4303-8922-9fa83a3983cb
> 2020-01-13 14:56:33,042 WARN org.apache.ratis.grpc.server.GrpcLogAppender:
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler:
> Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2020-01-13 14:56:33,043 DEBUG org.apache.ratis.util.PeerProxyMap:
> ed90869c-317e-4303-8922-9fa83a3983cb: reset proxy for
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle:
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: RUNNING -> CLOSING
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle:
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: CLOSING -> CLOSED
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle:
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: NEW
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.TimeoutScheduler: new
> ScheduledThreadPoolExecutor
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.PeerProxyMap:
> ed90869c-317e-4303-8922-9fa83a3983cb: Closing proxy for peer
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858
> 2020-01-13 14:56:33,045 DEBUG org.apache.ratis.util.TimeoutScheduler:
> schedule a task: timeout 6000ms, sid 1
> 2020-01-13 14:56:33,047 INFO org.apache.ratis.server.impl.FollowerInfo:
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex:
> updateUnconditionally 314 -> 1 ---------------------> set the next index for
> the follower back to 1 and starts from 1)
> 2020-01-13 14:56:35,840 DEBUG org.apache.ratis.grpc.server.GrpcLogAppender:
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler:
> received the first reply
> ed90869c-317e-4303-8922-9fa83a3983cb<-b65b0b6c-b0bb-429f-a23d-467c72d4b85c#2:OK,SUCCESS,nextIndex:314,term:5,followerCommit:313,
> request=AppendEntriesRequest:cid=2,entriesCount=0,lastEntry=null .
> -------------------> (Receives the response from follower indficating
> follower is at 312)
> Although the follower is at 313, the leader keeps on sending the
> appendRequests from index 1.
> 2020-01-13 14:56:35,841 DEBUG org.apache.ratis.server.impl.FollowerInfo:
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex:
> updateIncreasingly 1 -> 2
> 2020-01-13 14:56:35,841 DEBUG org.apache.ratis.util.TimeoutScheduler:
> schedule a task: timeout 6000ms, sid 7
> 2020-01-13 14:56:35,843 DEBUG org.apache.ratis.server.impl.FollowerInfo:
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex:
> updateIncreasingly 2 -> 3
> 2020-01-13 14:56:35,843 DEBUG org.apache.ratis.util.TimeoutScheduler:
> schedule a task: timeout 6000ms, sid 8
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)