[jira] [Updated] (HDDS-12659) One File per Container Storage Layout

Ivan Andika (Jira) Fri, 21 Mar 2025 03:42:31 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-12659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-12659:
-------------------------------
    Description: 
In the future, Ozone can support "one file per container" storage layout. 
Currently, Ozone support FilePerBlock and FilePerChunk (deprecated).

The current FilePerBlock storage layout have the following benefits:
 * No write contentions for writing blocks belonging to the write container
 ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
of Raft algorithm
 * Block file can be deleted as soon as the datanode receives the deletion 
command

However, the FilePerBlock layout is not good for handling a lot of small files 
since each block is stored a separate file. This increases the inode tree size 
of the datanodes and cause memory issues when we need to check all the block 
files (e.g. scanner / volume size using "du"). So while Ozone solves the small 
files problem related to the metadata (by storing it persistently in RocksDB), 
we haven't fully resolved the small file issues related to the data (storage 
layout).

An alternative storage layout can be one file per container. This is 
implemented in some existing distributed object storage / file system like 
SeaweedFS's volume (similar to Facebook's Haystack)

This has the benefit of reducing the small files in the datanode. One container 
file can contain hundreds or thousands of logical files. 

However, this also comes with some drawbacks:
 * Bookkeeping required
 ** We need to keep some metadata (e.g. to track which blocks are in which 
offset of the container file)
 * Deletion is not direct
 ** Deletion of a block needs to mark the particular block as deleted
 ** A separate background task will run the compaction task where it will 
create a new container file with the deleted blocks removed
 *** This can momentarily increase the datanode space usage since a new file 
needs to be created
 * Write contention on the same file
 ** If two clients are writing to the same container file at the same time, a 
file lock needs to be used to prevent race condition
 ** This introduces write contention and will reduce the write throughput.

We might also store the small files directly in the RocksDB (e.g. using 
[https://github.com/facebook/rocksdb/wiki/BlobDB]).

This is a long-term wish to kickstart discussions the feasibility of this 
storage layout in Ozone in the future.

  was:
In the future, Ozone can support "one file per container" storage layout. 
Currently, Ozone support FilePerBlock and FilePerChunk (deprecated).

The current FilePerBlock storage layout have the following benefits:
 * No write contentions for writing blocks belonging to the write container
 ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
of Raft algorithm
 * Block file can be deleted as soon as the datanode receives the deletion 
command

However, the FilePerBlock layout is not good for handling a lot of small files 
since each block is stored a separate file. This increases the inode tree size 
of the datanodes and cause memory issues when we need to check all the block 
files (e.g. scanner / volume size using "du"). So while Ozone solves the small 
files problem related to the metadata (by storing it persistently in RocksDB), 
we haven't fully resolved the small file issues related to the data (storage 
layout).

An alternative storage layout can be one file per container. This is 
implemented in some existing distributed object storage / file system like 
SeaweedFS's volume (similar to Facebook's Haystack)

This has the benefit of reducing the small files in the datanode. One container 
file can contains hundreds or thousands of logical files. 

However, this also comes with some drawbacks:
 * Bookkeeping required
 ** We need to keep some metadata (e.g. to track which blocks are in which 
offset of the container file)
 * Deletion is not direct
 ** Deletion of a block needs to mark the particular block as deleted
 ** A separate background task will run the compaction task where it will 
create a new container file with the deleted blocks removed
 *** This can momentarily increase the datanode space usage since a new file 
needs to be created
 * Write contention on the same file
 ** If two clients are writing to the same container file at the same time, a 
file lock needs to be used to prevent race condition
 ** This introduces write contention and will reduce the write throughput.

We might also store the small files directly in the RocksDB (e.g. using 
[https://github.com/facebook/rocksdb/wiki/BlobDB]).

This is a long-term wish to kickstart discussions the feasibility of this 
storage layout in Ozone in the future.


> One File per Container Storage Layout
> -------------------------------------
>
>                 Key: HDDS-12659
>                 URL: https://issues.apache.org/jira/browse/HDDS-12659
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> In the future, Ozone can support "one file per container" storage layout. 
> Currently, Ozone support FilePerBlock and FilePerChunk (deprecated).
> The current FilePerBlock storage layout have the following benefits:
>  * No write contentions for writing blocks belonging to the write container
>  ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
> of Raft algorithm
>  * Block file can be deleted as soon as the datanode receives the deletion 
> command
> However, the FilePerBlock layout is not good for handling a lot of small 
> files since each block is stored a separate file. This increases the inode 
> tree size of the datanodes and cause memory issues when we need to check all 
> the block files (e.g. scanner / volume size using "du"). So while Ozone 
> solves the small files problem related to the metadata (by storing it 
> persistently in RocksDB), we haven't fully resolved the small file issues 
> related to the data (storage layout).
> An alternative storage layout can be one file per container. This is 
> implemented in some existing distributed object storage / file system like 
> SeaweedFS's volume (similar to Facebook's Haystack)
> This has the benefit of reducing the small files in the datanode. One 
> container file can contain hundreds or thousands of logical files. 
> However, this also comes with some drawbacks:
>  * Bookkeeping required
>  ** We need to keep some metadata (e.g. to track which blocks are in which 
> offset of the container file)
>  * Deletion is not direct
>  ** Deletion of a block needs to mark the particular block as deleted
>  ** A separate background task will run the compaction task where it will 
> create a new container file with the deleted blocks removed
>  *** This can momentarily increase the datanode space usage since a new file 
> needs to be created
>  * Write contention on the same file
>  ** If two clients are writing to the same container file at the same time, a 
> file lock needs to be used to prevent race condition
>  ** This introduces write contention and will reduce the write throughput.
> We might also store the small files directly in the RocksDB (e.g. using 
> [https://github.com/facebook/rocksdb/wiki/BlobDB]).
> This is a long-term wish to kickstart discussions the feasibility of this 
> storage layout in Ozone in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-12659) One File per Container Storage Layout

Reply via email to