[ 
https://issues.apache.org/jira/browse/HDDS-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699459#comment-17699459
 ] 

Ivan Andika commented on HDDS-8131:
-----------------------------------

[~Nibiruxu] Thank you for following up on this.

I agree that within a single state machine, snapshotIndex is usually (or 
always) smaller than its commitIndex.

However, if there is a late follower, its commitIndex can be lower than the 
leader's snapshotIndex. If the leader purge up to its snapshotIndex, there is a 
chance where the leader's snapshotIndex is larger than the commitIndex and 
nextIndex of the late follower. Hence, there is a gap where the logs have been 
purged by the leader, but have not been replicated to the follower's log (i.e. 
{{followerNextIndex < logStartIndex}} in 
{{{}LogAppender#shouldInstallSnapshot(){}}}). The late follower is then forced 
to install the snapshot from the leader since it hasn't committed the logs 
between the late follower's commitIndex to the leader's snapshotIndex.

I will try to demonstrate using an example of 3 nodes Raft cluster:

Let's say raft.server.log.purge.gap = 1_000_000 and 
ozone.om.ratis.snapshot.auto.trigger.threshold=400_000

 

*With purgeUpToSnapshotIndex enabled:*

Leader: CommitIndex = 2_000_000, SnapshotIndex = 1_500_000 , lastPurgeIndex = 
400_000
 * If the leader trigger a snapshot, the leader will purge up to SnapshotIndex
 * So it only has logs from 1_500_000+ onwards

Up-to-date follower: CommitIndex = 2_000_000
 * This follower should not need to install snapshot from the leader since it 
is up-to-date and its commitIndex = 2_000_000 > leader's snapshotIndex = 
1_500_000 

 ** All the logs up to commitIndex = 2_000_000 has been replicated to the 
follower's log

Late follower: CommitIndex  = 1_000_000 ({*}possible since majority of Raft 
cluster commitIndex is up-to-date{*}), followerNextindex (from Leader's 
perspective) = 1_000_001
 * This follower is forced to install the snapshot from the leader since the 
follower doesn't have log 1_000_001 to 1_500_000, but the leader has purged 
these logs up to SnapshotIndex (1_500_000)
 ** Checked by Leader's {{LogAppender#shouldInstallSnapshot()}}
 ** The leader detects that it cannot use normal log replication to send to the 
follower since the logs have been purged from the leader
 ** Leader will need to notify the late follower to install snapshot by 
downloading the large OM metadata from the leader ({*}we want to avoid this 
case{*})

 

*Now with purgeUpToSnapshotIndex disabled:*

Leader: CommitIndex = 2_000_000, SnapshotIndex = 1_500_000, lastPurgeIndex = 
400_000
 * If it purges now, it will get the min of (its snapshotIndex, commitIndex of 
all peers)
 * min(1_500_000, 2_000_000, 2_000_000, 1_000_000) = 1_000_000
 * So it only allows to purge up to index 1_000_000 since it's the lowest 
commitIndex out of all the peers

Up-to-date follower: CommitIndex = 2_000_000
 * Similar to the previous example.

Late follower: CommitIndex = 1_000_000 ({*}valid since majority of peers' 
commitIndex is up-to-date{*}),  followerNextIndex (from Leader's perspective) = 
1_000_001
 * This follower does not need to install the snapshot from leader since the 
leader has not purged the logs from 1_000_001 onwards
 ** Since {{{}followerNextIndex >= logStartindex{}}}, the leader will not 
notify the follower to install the snapshot from the leader
 ** The leader can just use the normal log replication to make the follower 
catch up with the leader

Therefore, under normal condition, there should not be a case where a late 
follower needs to install snapshot from the leader.

Might not be a perfect example, but hopefully this clarifies. Please let me 
know if you have any questions or counter-example.

> Add Configuration for OM Ratis Log Purge Tuning Parameters
> ----------------------------------------------------------
>
>                 Key: HDDS-8131
>                 URL: https://issues.apache.org/jira/browse/HDDS-8131
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Manager
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.3.0
>
>
> Currently Ozone Manager enables {{raft.server.log.purge.upto.snapshot.index}} 
> by default.
> However, for OM cluster with large metadata store, there might be a case 
> where OM leader purge its Ratis logs before a slow follower replicated it to 
> its log. This means that the follower needs to download the whole metadata 
> store from the OM leader. This can be problematic if the metadata store in 
> leader is too large.
> We should add two configurations in OM to enable/disable Ratis purge 
> parameters:
>  * {{raft.server.log.purge.upto.snapshot.index}}
>  ** Disabling this would guarantee that the OM leader will not purge its 
> Ratis log unless all the logs have been replicated to all the followers 
> (through {{{}commitIndex{}}}).
>  ** This would effectively means that there shouldn't be a case where the 
> slow follower needs to download the full metadata from the leader. So no 
> snapshot download from follower. For small OM metadata, it can be faster for 
> follower to download the leader's metadata snapshot than normally replicating 
> and applying the outstanding logs.
>  ** For a very slow follower / downed follower, the OM leader cannot purge 
> the log until the follower catch up to it. This might increase the disk space 
> usage for OM leader.
>  ** Default would be {{true}} to preserve the current OM snapshot behavior
>  * {{raft.server.log.purge.preservation.log.num}}
>  ** RATIS-1626 introduces logic to preserve the latest n won't-be-purged logs
>  ** Setting n > 0 while still enabling 
> {{raft.server.log.purge.upto.snapshot.index}} should balance a between the 
> cost of preserving & transferring logs and the cost of transferring snapshot.
>  ** Default would be 0 to preserve the current OM snapshot behavior



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to