[jira] [Comment Edited] (HDDS-12247) WaitForApply on data writes for ratis keys

Ivan Andika (Jira) Mon, 23 Jun 2025 07:03:09 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985231#comment-17985231
 ]


Ivan Andika edited comment on HDDS-12247 at 6/23/25 2:02 PM:
-------------------------------------------------------------

The current purpose is to ensure that when the PutBlock returns to the client, 
the chunks are written persistently (applied) in the disk in all 2 / 3 
datanodes. Currently, we use ALL_COMMITTED and MAJORITY_COMMITTED (when 
ALL_COMMITTED fails), which means that the only guarantee is that the chunks 
are only written on the DN leader since the followers might have committed the 
logs (i.e. promise to apply the commits and write the chunks sometimes in the 
future), but haven't applied it. Additionally, due to the Raft group remove 
mechanism that previously causes all Raft operation on the group to fails, 
those committed logs might not be applied in some cases. The issue was 
alleviated by RATIS-2245  by ensuring the Raft server should apply all the 
committed logs before Raft group can complete. However, there are still cases 
where this might not work (disk full, DN crash, etc). Therefore, we can 
sometimes see that only 1 or 2 replicas of the chunks / blocks are created 
(e.g. Container 1 has blockCommitSequenceId 10 in DN leader, but 
blockCommitSequenceId 8 in other DNs). If the DN leader (with the most 
up-to-date data) is down, client that tries to read the chunk written after 
blockCommitSequenceId 8 will throw UNKNOWN_BCSID or BCSID_MISMATCH).

However, this will increase the write latency considerably, so might not be 
desirable. 


was (Author: JIRAUSER298977):
The current purpose is to ensure that when the PutBlock returns to the client, 
the chunks are written persistently (applied) in the disk in all 2 / 3 
datanodes. Currently, we use ALL_COMMITTED and MAJORITY_COMMITTED when, which 
means that the only guarantee is that the chunks are only written on the DN 
leader since the followers might have committed the logs (i.e. promise to apply 
the commits and write the chunks sometimes in the future), but haven't applied 
it. Additionally, due to the Raft group remove mechanism, those committed logs 
might not be applied in some cases (e.g. disk is full, pipeline is closed, 
etc). Therefore, we can sometimes see that only 1 or 2 replicas of the chunks / 
blocks are created (e.g. Container 1 has blockCommitSequenceId 10 in DN leader, 
but blockCommitSequenceId 8 in other DNs). If the DN leader (with the most 
up-to-date data) is down, client that tries to read the chunk written after 
blockCommitSequenceId 8 will throw UNKNOWN_BCSID or BCSID_MISMATCH). 

However, this will increase the write latency considerably, so might not be 
desirable.

> WaitForApply on data writes for ratis keys
> ------------------------------------------
>
>                 Key: HDDS-12247
>                 URL: https://issues.apache.org/jira/browse/HDDS-12247
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Ozone Datanode
>            Reporter: Swaminathan Balachandran
>            Priority: Major
>
> In order to ensure the data is completely written to at least majority of the 
> nodes/all the nodes, a waitForApply where the ratis leader can optionally 
> wait for the followers to apply the transaction and just not commit the 
> transaction to raft log. This would mean that the applyIndex would have to be 
> shared in the raft protocol



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-12247) WaitForApply on data writes for ratis keys

Reply via email to