[
https://issues.apache.org/jira/browse/RATIS-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze updated RATIS-1707:
------------------------------
Component/s: server
> Fix corner case when getPrevious in LogAppenderBase
> ---------------------------------------------------
>
> Key: RATIS-1707
> URL: https://issues.apache.org/jira/browse/RATIS-1707
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Xu Shao Hong
> Assignee: Xu Shao Hong
> Priority: Major
> Attachments: 745_review.patch, ratis.png
>
> Time Spent: 3h 20m
> Remaining Estimate: 0h
>
> In the cluster, we found that there are many exceptions on {*}Failed
> appendEntries{*}.
> _Unexpected Index: previous is null but entries[0].getIndex()=3402406. So the
> follower cannot append an entry due to some limit._
> The scenario:
> Follower B restarted, leader A sent the entry and B can not find the previous
> log entry. A sent the notifyInstallSnapshot request to B, and B found its
> next index is larger than the leader's firstAvailableLogIndex(the index to
> install snapshot). A updated B's index according to the reply and sent the
> entries to B. A will find the previous entry TermIndex through
> ``getPrevious(long nextIndex)``, if nextIndex of raft log of B is exactly the
> same as startIndex of leader A (B needs the entries since A's
> firstAvailableLogIndex), A has purged its raft log and will check the
> snapshot Index through +server.getStateMachine().getLatestSnapshot()+ whether
> it equals to nextIndex - 1, if not then returns null.
> The problem is due to the uncertainty of purging raft log. If A has also been
> stopped, and thus triggered the takeSnapshot, the raft log may not be purged
> up to the snapshot index. The latest snapshot index from SM is not equal to
> the raft log's first available index, which leads to this corner case.
> We could add a case check when getPrevious.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)