[jira] [Updated] (HDDS-14246) Change fsync boundary for FilePerBlockStrategy to block level

Ivan Andika (Jira) Mon, 29 Dec 2025 00:58:07 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-14246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-14246:
-------------------------------
    Description: 
Currently, datanode has an option to flush the write on chunk boundary 
(hdds.container.chunk.write.sync) which is disabled by default since it might 
affect the DN write throughput and latency. However, disabling this means that 
if the datanode machine is suddenly down (e.g. power failure, reaped by OOM 
killer), this might cause the file to have incomplete data even if PutBlock 
(write commit) is successful which violates our durability guarantee. Although 
PutBlock triggers FilePerBlockStrategy#finishWriteChunks which will trigger 
close (RandomAccessFile#close), the buffer cache might not be flushed yet since 
closing a file does not imply that the buffer cache for the file is flushed 
(see [https://man7.org/linux/man-pages/man2/close.2.html]). So there might be a 
chance where the user's key's block locations are committed, but the blocks do 
not exist in datanodes due to aforementioned failures.

However, flushing for every WriteChunk might cause unnecessary overhead. We 
might need to consider calling FileChannel#force on PutBlock instead of 
WriteChunk since the data is only visible for users when PutBlock returns 
successfully (the data is committed). Therefore, we can guarantee that the 
after user successfully uploaded the key, the data has been persistently stored 
in the leader and at least one follower promise to flush the data 
(MAJORITY_COMMITTED).

This might still affect the write throughput and latency due to waiting for the 
buffer cached to be flushed to persistent storage (ssd or disk), but will 
increase our data durability guarantee (which should be our priority). Flushing 
the buffer cache might also reduce the memory usage of datanode.

In the future, we should consider enabling hdds.container.chunk.write.sync by 
default.

  was:
Currently, datanode has an option to flush the write on chunk boundary 
(hdds.container.chunk.write.sync) which is disabled by default since it might 
affect the DN write throughput and latency. However, disabling this means that 
if the datanode machine is suddenly down (e.g. power failure, reaped by OOM 
killer), this might cause the file to have incomplete data even if PutBlock 
(write commit) is successful which violates our durability guarantee. Although 
PutBlock triggers FilePerBlockStrategy#finishWriteChunks which will trigger 
close (RandomAccessFile#close), the buffer cache might not be flushed yet since 
closing a file does not imply that the buffer cache for the file is flushed 
(see [https://man7.org/linux/man-pages/man2/close.2.html]). So there might be a 
chance where the user's key's block locations are committed, but the blocks do 
not exist in datanodes due to aforementioned failures.

We might need to consider calling FileChannel#force on PutBlock instead of 
WriteChunk since the data is only visible for users when PutBlock returns 
successfully (the data is committed). Therefore, we can guarantee that the 
after user successfully uploaded the key, the data has been persistently stored 
in the leader and at least one follower promise to flush the data 
(MAJORITY_COMMITTED).

This might still affect the write throughput and latency due to waiting for the 
buffer cached to be flushed to persistent storage (ssd or disk), but will 
increase our data durability guarantee (which should be our priority). Flushing 
the buffer cache might also reduce the memory usage of datanode.

In the future, we should consider enabling hdds.container.chunk.write.sync by 
default.


> Change fsync boundary for FilePerBlockStrategy to block level
> -------------------------------------------------------------
>
>                 Key: HDDS-14246
>                 URL: https://issues.apache.org/jira/browse/HDDS-14246
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, datanode has an option to flush the write on chunk boundary 
> (hdds.container.chunk.write.sync) which is disabled by default since it might 
> affect the DN write throughput and latency. However, disabling this means 
> that if the datanode machine is suddenly down (e.g. power failure, reaped by 
> OOM killer), this might cause the file to have incomplete data even if 
> PutBlock (write commit) is successful which violates our durability 
> guarantee. Although PutBlock triggers FilePerBlockStrategy#finishWriteChunks 
> which will trigger close (RandomAccessFile#close), the buffer cache might not 
> be flushed yet since closing a file does not imply that the buffer cache for 
> the file is flushed (see 
> [https://man7.org/linux/man-pages/man2/close.2.html]). So there might be a 
> chance where the user's key's block locations are committed, but the blocks 
> do not exist in datanodes due to aforementioned failures.
> However, flushing for every WriteChunk might cause unnecessary overhead. We 
> might need to consider calling FileChannel#force on PutBlock instead of 
> WriteChunk since the data is only visible for users when PutBlock returns 
> successfully (the data is committed). Therefore, we can guarantee that the 
> after user successfully uploaded the key, the data has been persistently 
> stored in the leader and at least one follower promise to flush the data 
> (MAJORITY_COMMITTED).
> This might still affect the write throughput and latency due to waiting for 
> the buffer cached to be flushed to persistent storage (ssd or disk), but will 
> increase our data durability guarantee (which should be our priority). 
> Flushing the buffer cache might also reduce the memory usage of datanode.
> In the future, we should consider enabling hdds.container.chunk.write.sync by 
> default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14246) Change fsync boundary for FilePerBlockStrategy to block level

Reply via email to