[jira] [Updated] (HDDS-12659) One File per Container Storage Layout

Ivan Andika (Jira) Fri, 14 Nov 2025 00:18:40 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-12659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-12659:
-------------------------------
    Description: 
In the future, Ozone can support "one file per container" storage layout. 
Currently, Ozone support FilePerBlock (current) and FilePerChunk (deprecated).

The current FilePerBlock storage layout have the following benefits:
 * No write contentions for writing blocks belonging to the write container
 ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
of Raft algorithm
 * Block file can be deleted as soon as the datanode receives the deletion 
command

However, the FilePerBlock layout is not good for handling a lot of small files 
since each block is stored a separate file. This increases the inode tree size 
of the datanodes and cause memory issues when we need to check all the block 
files (e.g. scanner / volume size using "du"). So while Ozone alleviates the 
small files problem related to the metadata (by storing it persistently in 
RocksDB), we haven't fully addressed the small file issues related to the data 
(storage layout). We can check the number of inodes using "df -i" command.

For example, recently we saw that one DN has high inode and dentry cache during 
high read 



An alternative storage layout can be one file per container. This is 
implemented in some existing distributed object storage / file system like 
SeaweedFS's volume (similar to Facebook's Haystack). 

This has the benefit of reducing the small files in the datanode. One container 
file can contain hundreds or thousands of logical files. 

Additionally, we can move the container metadata to the file instead of the 
Container file to ensure O(1) disk seek per read. Currently, the we need to 
check the container DB first and then get the associated blocks which might 
container more disk seeks than necessary (depending on the read amplification 
of RocksDB, etc).

However, this also comes with some drawbacks:
 * Bookkeeping required
 ** We need to keep some metadata (e.g. to track which blocks are in which 
offset of the container file) which can be implemented as a separate "index 
file" or on the header (superblock) of the data file
 * Deletion is not direct
 ** Deletion of a block needs to mark the particular block as deleted
 ** A separate background task will run the compaction (garbage collection) 
task where it will create a new container file with the deleted blocks removed
 *** This can momentarily increase the datanode space usage since a new file 
needs to be created
 * Write contention on the same file
 ** If two clients are writing to the same container file at the same time, a 
file lock needs to be used to prevent race condition
 ** This introduces write contention and will reduce the write throughput.

We might also store the small files directly in the RocksDB (e.g. using 
[https://github.com/facebook/rocksdb/wiki/BlobDB]).

This is a long-term wish to kickstart discussions on the feasibility of this 
storage layout in Ozone in the future.

  was:
In the future, Ozone can support "one file per container" storage layout. 
Currently, Ozone support FilePerBlock (current) and FilePerChunk (deprecated).

The current FilePerBlock storage layout have the following benefits:
 * No write contentions for writing blocks belonging to the write container
 ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
of Raft algorithm
 * Block file can be deleted as soon as the datanode receives the deletion 
command

However, the FilePerBlock layout is not good for handling a lot of small files 
since each block is stored a separate file. This increases the inode tree size 
of the datanodes and cause memory issues when we need to check all the block 
files (e.g. scanner / volume size using "du"). So while Ozone alleviates the 
small files problem related to the metadata (by storing it persistently in 
RocksDB), we haven't fully addressed the small file issues related to the data 
(storage layout). We can check the number of inodes using "df -i" command.

An alternative storage layout can be one file per container. This is 
implemented in some existing distributed object storage / file system like 
SeaweedFS's volume (similar to Facebook's Haystack). 

This has the benefit of reducing the small files in the datanode. One container 
file can contain hundreds or thousands of logical files. 

Additionally, we can move the container metadata to the file instead of the 
Container file to ensure O(1) disk seek per read. Currently, the we need to 
check the container DB first and then get the associated blocks which might 
container more disk seeks than necessary (depending on the read amplification 
of RocksDB, etc).

However, this also comes with some drawbacks:
 * Bookkeeping required
 ** We need to keep some metadata (e.g. to track which blocks are in which 
offset of the container file) which can be implemented as a separate "index 
file" or on the header (superblock) of the data file
 * Deletion is not direct
 ** Deletion of a block needs to mark the particular block as deleted
 ** A separate background task will run the compaction (garbage collection) 
task where it will create a new container file with the deleted blocks removed
 *** This can momentarily increase the datanode space usage since a new file 
needs to be created
 * Write contention on the same file
 ** If two clients are writing to the same container file at the same time, a 
file lock needs to be used to prevent race condition
 ** This introduces write contention and will reduce the write throughput.

We might also store the small files directly in the RocksDB (e.g. using 
[https://github.com/facebook/rocksdb/wiki/BlobDB]).

This is a long-term wish to kickstart discussions on the feasibility of this 
storage layout in Ozone in the future.


> One File per Container Storage Layout
> -------------------------------------
>
>                 Key: HDDS-12659
>                 URL: https://issues.apache.org/jira/browse/HDDS-12659
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> In the future, Ozone can support "one file per container" storage layout. 
> Currently, Ozone support FilePerBlock (current) and FilePerChunk (deprecated).
> The current FilePerBlock storage layout have the following benefits:
>  * No write contentions for writing blocks belonging to the write container
>  ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
> of Raft algorithm
>  * Block file can be deleted as soon as the datanode receives the deletion 
> command
> However, the FilePerBlock layout is not good for handling a lot of small 
> files since each block is stored a separate file. This increases the inode 
> tree size of the datanodes and cause memory issues when we need to check all 
> the block files (e.g. scanner / volume size using "du"). So while Ozone 
> alleviates the small files problem related to the metadata (by storing it 
> persistently in RocksDB), we haven't fully addressed the small file issues 
> related to the data (storage layout). We can check the number of inodes using 
> "df -i" command.
> For example, recently we saw that one DN has high inode and dentry cache 
> during high read 
> An alternative storage layout can be one file per container. This is 
> implemented in some existing distributed object storage / file system like 
> SeaweedFS's volume (similar to Facebook's Haystack). 
> This has the benefit of reducing the small files in the datanode. One 
> container file can contain hundreds or thousands of logical files. 
> Additionally, we can move the container metadata to the file instead of the 
> Container file to ensure O(1) disk seek per read. Currently, the we need to 
> check the container DB first and then get the associated blocks which might 
> container more disk seeks than necessary (depending on the read amplification 
> of RocksDB, etc).
> However, this also comes with some drawbacks:
>  * Bookkeeping required
>  ** We need to keep some metadata (e.g. to track which blocks are in which 
> offset of the container file) which can be implemented as a separate "index 
> file" or on the header (superblock) of the data file
>  * Deletion is not direct
>  ** Deletion of a block needs to mark the particular block as deleted
>  ** A separate background task will run the compaction (garbage collection) 
> task where it will create a new container file with the deleted blocks removed
>  *** This can momentarily increase the datanode space usage since a new file 
> needs to be created
>  * Write contention on the same file
>  ** If two clients are writing to the same container file at the same time, a 
> file lock needs to be used to prevent race condition
>  ** This introduces write contention and will reduce the write throughput.
> We might also store the small files directly in the RocksDB (e.g. using 
> [https://github.com/facebook/rocksdb/wiki/BlobDB]).
> This is a long-term wish to kickstart discussions on the feasibility of this 
> storage layout in Ozone in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-12659) One File per Container Storage Layout

Reply via email to