[jira] [Updated] (HDDS-12979) Increasing Ratis Write Pipeline Granularity

Ivan Andika (Jira) Fri, 17 Oct 2025 15:26:37 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-12979:
-------------------------------
    Description: 
Currently, RATIS/THREE write pipeline consists of a single Raft group of three 
datanodes.

We found in the previous write tests that the sequential nature of Raft 
consensus algorithm is a write bottleneck (instead of the I/O of the datanode 
volumes) even for writes that are not related to each other (e.g. unrelated 
writes for different containers in the same pipeline or different blocks for 
the same container might interfere with each other). When we increased the 
number of pipelines per datanode (ozone.scm.datanode.pipeline.limit), we saw 
quite a significant increase of the overall write throughput.

We can think about increasing the granularity of the Ratis write pipeline. For 
example, we can have so that one write pipeline consists of multiple Raft 
groups of mutually exclusive datanode volumes. This way, writes can be 
parallelized across volumes and the overall throughput can be increased. 
Additionally, we can also ensure volume isolation, that is, once a volume is 
chosen in an active write pipeline, it will not be chosen for another write 
pipeline. Therefore, there won't be issues where one write in a single pipeline 
interfere with another, although disk space issues due to things like container 
replication might still happen.

We can go even more granular where each open container is a Raft group and 
closing a container means closing the Ratis group as well.

To decide the correct level of granularity, we need to decide cases when 
concurrent writes become acceptable (i.e. consistency guarantees of writes). 
From what I see since we have one file per block, technically the only thing 
that needs to be serialized is the order of WriteChunk and the PutBlock (acts 
as write commit).

There are definitely overheads when increasing the number of Ratis group, such 
as the lifecycle of each Ratis group and the overheads associated with it (e.g. 
direct buffer allocation for "raft.server.log.write.buffer.size"). Also, we 
might need to increase the number of pipelines being tracked by SCM. 
Additionally, putting the Ratis in the data volume (which usually use spinning 
disks) might slow down a single pipeline performance at the cost of increased 
availability (we might need to increase the write buffering in Ratis to prevent 
unnecessary disk seeks). We need to do some testing whether the performance 
improvements is worth the overhead and overall complexity.

Related GitHub discussion is 
[https://github.com/apache/ozone/discussions/7505], which also highlights that 
since the Ratis related metadata is only stored in a single volume, if that 
volume is down, the datanode cannot serve writes and be marked as unhealthy. 
Making writes more granular can increase the write availability of datanode in 
case of failure.

Another idea is that we can throttle the writes based on the disk / ssd volume 
theoretical bandwidth.

  was:
Currently, RATIS/THREE write pipeline consists of a single Raft group of three 
datanodes.

We found in the previous write tests that the sequential nature of Raft 
consensus algorithm is a write bottleneck (instead of the I/O of the datanode 
volumes) even for writes that are not related to each other (e.g. unrelated 
writes for different containers in the same pipeline or different blocks for 
the same container might interfere with each other). When we increased the 
number of pipelines per datanode (ozone.scm.datanode.pipeline.limit), we saw 
quite a significant increase of the overall write throughput.

We can think about increasing the granularity of the Ratis write pipeline. For 
example, we can have so that one write pipeline consists of multiple Raft 
groups of mutually exclusive datanode volumes. This way, writes can be 
parallelized across volumes and the overall throughput can be increased. 
Additionally, we can also ensure volume isolation, that is, once a volume is 
chosen in an active write pipeline, it will not be chosen for another write 
pipeline. Therefore, there won't be issues where one write in a single pipeline 
interfere with another, although disk space issues due to things like container 
replication might still happen.

We can go even more granular where each open container is a Raft group and 
closing a container means closing the Ratis group as well.

To decide the correct level of granularity, we need to decide cases when 
concurrent writes become acceptable (i.e. consistency guarantees of writes). 
From what I see since we have one file per block, technically the only thing 
that needs to be serialized is the order of WriteChunk and the PutBlock (acts 
as write commit).

There are definitely overheads when increasing the number of Ratis group, such 
as the lifecycle of each Ratis group and the overheads associated with it (e.g. 
direct buffer allocation for "raft.server.log.write.buffer.size"). Also, we 
might need to increase the number of pipelines being tracked by SCM. We need to 
do some testing whether the performance improvements is worth the overhead and 
overall complexity.

Related GitHub discussion is 
[https://github.com/apache/ozone/discussions/7505], which also highlights that 
since the Ratis related metadata is only stored in a single volume, if that 
volume is down, the datanode cannot serve writes and be marked as unhealthy. 
Making writes more granular can increase the write availability of datanode in 
case of failure.

Another idea is that we can throttle the writes based on the disk / ssd volume 
theoretical bandwidth.


> Increasing Ratis Write Pipeline Granularity
> -------------------------------------------
>
>                 Key: HDDS-12979
>                 URL: https://issues.apache.org/jira/browse/HDDS-12979
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, RATIS/THREE write pipeline consists of a single Raft group of 
> three datanodes.
> We found in the previous write tests that the sequential nature of Raft 
> consensus algorithm is a write bottleneck (instead of the I/O of the datanode 
> volumes) even for writes that are not related to each other (e.g. unrelated 
> writes for different containers in the same pipeline or different blocks for 
> the same container might interfere with each other). When we increased the 
> number of pipelines per datanode (ozone.scm.datanode.pipeline.limit), we saw 
> quite a significant increase of the overall write throughput.
> We can think about increasing the granularity of the Ratis write pipeline. 
> For example, we can have so that one write pipeline consists of multiple Raft 
> groups of mutually exclusive datanode volumes. This way, writes can be 
> parallelized across volumes and the overall throughput can be increased. 
> Additionally, we can also ensure volume isolation, that is, once a volume is 
> chosen in an active write pipeline, it will not be chosen for another write 
> pipeline. Therefore, there won't be issues where one write in a single 
> pipeline interfere with another, although disk space issues due to things 
> like container replication might still happen.
> We can go even more granular where each open container is a Raft group and 
> closing a container means closing the Ratis group as well.
> To decide the correct level of granularity, we need to decide cases when 
> concurrent writes become acceptable (i.e. consistency guarantees of writes). 
> From what I see since we have one file per block, technically the only thing 
> that needs to be serialized is the order of WriteChunk and the PutBlock (acts 
> as write commit).
> There are definitely overheads when increasing the number of Ratis group, 
> such as the lifecycle of each Ratis group and the overheads associated with 
> it (e.g. direct buffer allocation for "raft.server.log.write.buffer.size"). 
> Also, we might need to increase the number of pipelines being tracked by SCM. 
> Additionally, putting the Ratis in the data volume (which usually use 
> spinning disks) might slow down a single pipeline performance at the cost of 
> increased availability (we might need to increase the write buffering in 
> Ratis to prevent unnecessary disk seeks). We need to do some testing whether 
> the performance improvements is worth the overhead and overall complexity.
> Related GitHub discussion is 
> [https://github.com/apache/ozone/discussions/7505], which also highlights 
> that since the Ratis related metadata is only stored in a single volume, if 
> that volume is down, the datanode cannot serve writes and be marked as 
> unhealthy. Making writes more granular can increase the write availability of 
> datanode in case of failure.
> Another idea is that we can throttle the writes based on the disk / ssd 
> volume theoretical bandwidth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-12979) Increasing Ratis Write Pipeline Granularity

Reply via email to