[
https://issues.apache.org/jira/browse/HDDS-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644951#comment-17644951
]
Prashant Pogde commented on HDDS-6510:
--------------------------------------
[~Nibiruxu] yes in HA mode all OM nodes trigger snapshot request at the exact
same point in time through RATIS transaction. We need to add additional logic
to handle snapshots as part of incremental checkpointing. Please also sync up
with [~georgeJahad]. He is looking at this logic to sync up lagging followers
with leader .
> Incremental Checkpointing Support
> ---------------------------------
>
> Key: HDDS-6510
> URL: https://issues.apache.org/jira/browse/HDDS-6510
> Project: Apache Ozone
> Issue Type: New Feature
> Reporter: Xu Shao Hong
> Assignee: Xu Shao Hong
> Priority: Major
> Labels: pull-request-available
> Attachments: 2022-03-15 7.58.44.png
>
>
> Currently, each time to install a snapshot for OM and SCM is to get a
> checkpoint of RDB and send it to the follower. As the data stored in RDB
> increases, the very long transmission time of the whole checkpoint could be a
> large cost, which could cause the follower to install the snapshot repeatedly
> if it finds out the leader has already truncated the new raft logs and needs
> to install a new snapshot.
> Given an example in the test(OM), the raft log index is 570767469, it takes
> around 13 minutes for the follower to install the snapshot. As ozone is
> designed to overcome the shortage of in-memory metadata, it should have the
> ability to preserve much more data than a hundred million level. Once the OM
> has reached that level, each time to install snapshot would be a big problem.
> There will be only two raft peers working (if we set up 3-node HA) and that
> condition is fragile.
> Another statics: For 16 hundred million keys, the size of om.db directory is
> 45GB. Around 2.8 hundred million keys/GB. This is tested through createKey
> api.
> To solve the problem, we should have Incremental Checkpointing. This could
> provide another slight increment instead of the whole RDB checkpoint and thus
> reduce the time of transmission. I recommend referring to the implementation
> in FLINK, but we need to store the diff of checkpoints locally instead of
> another storage system.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]