[
https://issues.apache.org/jira/browse/HDDS-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699459#comment-17699459
]
Ivan Andika commented on HDDS-8131:
-----------------------------------
[~Nibiruxu] Thank you for following up on this.
I agree that within a single state machine, snapshotIndex is usually (or
always) smaller than its commitIndex.
However, if there is a late follower, its commitIndex can be lower than the
leader's snapshotIndex. If the leader purge up to its snapshotIndex, there is a
chance where the leader's snapshotIndex is larger than the commitIndex and
nextIndex of the late follower. Hence, there is a gap where the logs have been
purged by the leader, but have not been replicated to the follower's log (i.e.
{{followerNextIndex < logStartIndex}} in
{{{}LogAppender#shouldInstallSnapshot(){}}}). The late follower is then forced
to install the snapshot from the leader since it hasn't committed the logs
between the late follower's commitIndex to the leader's snapshotIndex.
I will try to demonstrate using an example of 3 nodes Raft cluster:
Let's say raft.server.log.purge.gap = 1_000_000 and
ozone.om.ratis.snapshot.auto.trigger.threshold=400_000
*With purgeUpToSnapshotIndex enabled:*
Leader: CommitIndex = 2_000_000, SnapshotIndex = 1_500_000 , lastPurgeIndex =
400_000
* If the leader trigger a snapshot, the leader will purge up to SnapshotIndex
* So it only has logs from 1_500_000+ onwards
Up-to-date follower: CommitIndex = 2_000_000
* This follower should not need to install snapshot from the leader since it
is up-to-date and its commitIndex = 2_000_000 > leader's snapshotIndex =
1_500_000
** All the logs up to commitIndex = 2_000_000 has been replicated to the
follower's log
Late follower: CommitIndex = 1_000_000 ({*}possible since majority of Raft
cluster commitIndex is up-to-date{*}), followerNextindex (from Leader's
perspective) = 1_000_001
* This follower is forced to install the snapshot from the leader since the
follower doesn't have log 1_000_001 to 1_500_000, but the leader has purged
these logs up to SnapshotIndex (1_500_000)
** Checked by Leader's {{LogAppender#shouldInstallSnapshot()}}
** The leader detects that it cannot use normal log replication to send to the
follower since the logs have been purged from the leader
** Leader will need to notify the late follower to install snapshot by
downloading the large OM metadata from the leader ({*}we want to avoid this
case{*})
*Now with purgeUpToSnapshotIndex disabled:*
Leader: CommitIndex = 2_000_000, SnapshotIndex = 1_500_000, lastPurgeIndex =
400_000
* If it purges now, it will get the min of (its snapshotIndex, commitIndex of
all peers)
* min(1_500_000, 2_000_000, 2_000_000, 1_000_000) = 1_000_000
* So it only allows to purge up to index 1_000_000 since it's the lowest
commitIndex out of all the peers
Up-to-date follower: CommitIndex = 2_000_000
* Similar to the previous example.
Late follower: CommitIndex = 1_000_000 ({*}valid since majority of peers'
commitIndex is up-to-date{*}), followerNextIndex (from Leader's perspective) =
1_000_001
* This follower does not need to install the snapshot from leader since the
leader has not purged the logs from 1_000_001 onwards
** Since {{{}followerNextIndex >= logStartindex{}}}, the leader will not
notify the follower to install the snapshot from the leader
** The leader can just use the normal log replication to make the follower
catch up with the leader
Therefore, under normal condition, there should not be a case where a late
follower needs to install snapshot from the leader.
Might not be a perfect example, but hopefully this clarifies. Please let me
know if you have any questions or counter-example.
> Add Configuration for OM Ratis Log Purge Tuning Parameters
> ----------------------------------------------------------
>
> Key: HDDS-8131
> URL: https://issues.apache.org/jira/browse/HDDS-8131
> Project: Apache Ozone
> Issue Type: Improvement
> Components: Ozone Manager
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.3.0
>
>
> Currently Ozone Manager enables {{raft.server.log.purge.upto.snapshot.index}}
> by default.
> However, for OM cluster with large metadata store, there might be a case
> where OM leader purge its Ratis logs before a slow follower replicated it to
> its log. This means that the follower needs to download the whole metadata
> store from the OM leader. This can be problematic if the metadata store in
> leader is too large.
> We should add two configurations in OM to enable/disable Ratis purge
> parameters:
> * {{raft.server.log.purge.upto.snapshot.index}}
> ** Disabling this would guarantee that the OM leader will not purge its
> Ratis log unless all the logs have been replicated to all the followers
> (through {{{}commitIndex{}}}).
> ** This would effectively means that there shouldn't be a case where the
> slow follower needs to download the full metadata from the leader. So no
> snapshot download from follower. For small OM metadata, it can be faster for
> follower to download the leader's metadata snapshot than normally replicating
> and applying the outstanding logs.
> ** For a very slow follower / downed follower, the OM leader cannot purge
> the log until the follower catch up to it. This might increase the disk space
> usage for OM leader.
> ** Default would be {{true}} to preserve the current OM snapshot behavior
> * {{raft.server.log.purge.preservation.log.num}}
> ** RATIS-1626 introduces logic to preserve the latest n won't-be-purged logs
> ** Setting n > 0 while still enabling
> {{raft.server.log.purge.upto.snapshot.index}} should balance a between the
> cost of preserving & transferring logs and the cost of transferring snapshot.
> ** Default would be 0 to preserve the current OM snapshot behavior
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]