[
https://issues.apache.org/jira/browse/HDDS-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985231#comment-17985231
]
Ivan Andika edited comment on HDDS-12247 at 6/23/25 2:02 PM:
-------------------------------------------------------------
The current purpose is to ensure that when the PutBlock returns to the client,
the chunks are written persistently (applied) in the disk in all 2 / 3
datanodes. Currently, we use ALL_COMMITTED and MAJORITY_COMMITTED (when
ALL_COMMITTED fails), which means that the only guarantee is that the chunks
are only written on the DN leader since the followers might have committed the
logs (i.e. promise to apply the commits and write the chunks sometimes in the
future), but haven't applied it. Additionally, due to the Raft group remove
mechanism that previously causes all Raft operation on the group to fails,
those committed logs might not be applied in some cases. The issue was
alleviated by RATIS-2245 by ensuring the Raft server should apply all the
committed logs before Raft group can complete. However, there are still cases
where this might not work (disk full, DN crash, etc). Therefore, we can
sometimes see that only 1 or 2 replicas of the chunks / blocks are created
(e.g. Container 1 has blockCommitSequenceId 10 in DN leader, but
blockCommitSequenceId 8 in other DNs). If the DN leader (with the most
up-to-date data) is down, client that tries to read the chunk written after
blockCommitSequenceId 8 will throw UNKNOWN_BCSID or BCSID_MISMATCH).
However, this will increase the write latency considerably, so might not be
desirable.
was (Author: JIRAUSER298977):
The current purpose is to ensure that when the PutBlock returns to the client,
the chunks are written persistently (applied) in the disk in all 2 / 3
datanodes. Currently, we use ALL_COMMITTED and MAJORITY_COMMITTED when, which
means that the only guarantee is that the chunks are only written on the DN
leader since the followers might have committed the logs (i.e. promise to apply
the commits and write the chunks sometimes in the future), but haven't applied
it. Additionally, due to the Raft group remove mechanism, those committed logs
might not be applied in some cases (e.g. disk is full, pipeline is closed,
etc). Therefore, we can sometimes see that only 1 or 2 replicas of the chunks /
blocks are created (e.g. Container 1 has blockCommitSequenceId 10 in DN leader,
but blockCommitSequenceId 8 in other DNs). If the DN leader (with the most
up-to-date data) is down, client that tries to read the chunk written after
blockCommitSequenceId 8 will throw UNKNOWN_BCSID or BCSID_MISMATCH).
However, this will increase the write latency considerably, so might not be
desirable.
> WaitForApply on data writes for ratis keys
> ------------------------------------------
>
> Key: HDDS-12247
> URL: https://issues.apache.org/jira/browse/HDDS-12247
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Ozone Datanode
> Reporter: Swaminathan Balachandran
> Priority: Major
>
> In order to ensure the data is completely written to at least majority of the
> nodes/all the nodes, a waitForApply where the ratis leader can optionally
> wait for the followers to apply the transaction and just not commit the
> transaction to raft log. This would mean that the applyIndex would have to be
> shared in the raft protocol
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]