[ 
https://issues.apache.org/jira/browse/RATIS-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022039#comment-17022039
 ] 

Shashikant Banerjee commented on RATIS-794:
-------------------------------------------

Thanks [~szetszwo] for the patch. The patch needs to be rebased. Can you please 
check?

> Ratils leader should retry append requests based on follower commit info in 
> case of intermittent append failures
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: RATIS-794
>                 URL: https://issues.apache.org/jira/browse/RATIS-794
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Shashikant Banerjee
>            Assignee: Tsz-wo Sze
>            Priority: Major
>             Fix For: 0.5.0
>
>         Attachments: r794_20200122.patch
>
>
> During Ozone testing, it was observed that a leader election happens in 
> between the test , where a follower has caught to a certain index 313. The 
> new leader starts sends an append request to the follower which fails with 
> grpc Exception. This leads to leader reset the connection and start from the 
> beginning (index 1). 
>  
>  
> {code:java}
> 2020-01-13 14:56:32,995 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 0.0.0.0:9858@group-4F125BF42C14: changes role from CANDIDATE to LEADER at 
> term 7 for changeToLeader
> 2020-01-13 14:56:32,995 INFO 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  Leader change notification received for group: group-4F125BF42C14 with new 
> leaderId: ed90869c-317e-4303-8922-9fa83a3983cb
> 2020-01-13 14:56:33,042 WARN org.apache.ratis.grpc.server.GrpcLogAppender: 
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler:
>  Failed appendEntries: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2020-01-13 14:56:33,043 DEBUG org.apache.ratis.util.PeerProxyMap: 
> ed90869c-317e-4303-8922-9fa83a3983cb: reset proxy for 
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: 
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: RUNNING -> CLOSING
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: 
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: CLOSING -> CLOSED
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: 
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: NEW
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.TimeoutScheduler: new 
> ScheduledThreadPoolExecutor
> 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.PeerProxyMap: 
> ed90869c-317e-4303-8922-9fa83a3983cb: Closing proxy for peer 
> b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858
> 2020-01-13 14:56:33,045 DEBUG org.apache.ratis.util.TimeoutScheduler: 
> schedule a task: timeout 6000ms, sid 1 
> 2020-01-13 14:56:33,047 INFO org.apache.ratis.server.impl.FollowerInfo: 
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: 
> updateUnconditionally 314 -> 1 ---------------------> set the next index for 
> the follower back to 1 and  starts from 1)
> 2020-01-13 14:56:35,840 DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler:
>  received the first reply 
> ed90869c-317e-4303-8922-9fa83a3983cb<-b65b0b6c-b0bb-429f-a23d-467c72d4b85c#2:OK,SUCCESS,nextIndex:314,term:5,followerCommit:313,
>  request=AppendEntriesRequest:cid=2,entriesCount=0,lastEntry=null .  
> -------------------> (Receives the response from follower indficating 
> follower is at 312)
> Although the follower is at 313, the leader keeps on sending the 
> appendRequests from index 1. 
> 2020-01-13 14:56:35,841 DEBUG org.apache.ratis.server.impl.FollowerInfo: 
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: 
> updateIncreasingly 1 -> 2
> 2020-01-13 14:56:35,841 DEBUG org.apache.ratis.util.TimeoutScheduler: 
> schedule a task: timeout 6000ms, sid 7
> 2020-01-13 14:56:35,843 DEBUG org.apache.ratis.server.impl.FollowerInfo: 
> 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: 
> updateIncreasingly 2 -> 3
> 2020-01-13 14:56:35,843 DEBUG org.apache.ratis.util.TimeoutScheduler: 
> schedule a task: timeout 6000ms, sid 8
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to