[jira] [Commented] (HDDS-12578) Ozone on CRAQ

Ethan Rose (Jira) Fri, 25 Apr 2025 13:04:24 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-12578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947413#comment-17947413
 ]


Ethan Rose commented on HDDS-12578:
-----------------------------------

After 12k Jiras this one probably has my favorite title : )

It helps to look at what properties we need out of our write pipeline:
 * In the context of rolling upgrades, the easiest implementation to support is 
stateless pipelines, where the pipeline is a group of nodes chosen on the fly 
by the management layer (SCM for Ozone, NameNode for HDFS) for each write 
operation. This way there is no extra state management (like closing pipelines) 
required when a node is shut down for upgrade and no ripple affect through the 
cluster where peers need to behave differently (close pipeline, potentially be 
added to new pipelines to compensate) just because a different node was shut 
down.

 * We don't need dynamic pipeline reconfiguration by the pipeline members 
themselves. We already have a consensus layer above our datanodes (SCM) that 
can easily handle "reconfiguration" by just giving a new set of nodes when 
allocates a new block because the last one failed to write.

 * Write all/read any semantics for durability and performance. Slow nodes in a 
write pipeline can be identified through reporting and either excluded, have 
load reduced, or alerted to the admin.

Then we can look at implementations that meet these requirements:
 * *HDFS chained replication* meets all of these requirements:
 ** Stateless: nodes to service a request are chosen on the fly.
 ** No reconfiguration logic required: Clients request a new pipeline with an 
exclude list of nodes when a write fails, and they can pick up where they left 
off.
 *** IIRC there is some fine print here, where HDFS may try to reconfigure the 
pipeline to avoid making the client rewrite the block. Ozone does not have this 
problem since we can partially fill a block and pick up where we left off in a 
new block if needed.
 ** Write all/read any semantics

 * *Ozone Ratis pipelines* do not meet these requirements very well:
 ** Stateful: tracked on datanodes and in SCM.
 ** Partially reconfigurable (leader/follower) but does not support membership 
changes.
 ** Supports partial writes with majority commit semantics, which causes 
complications when containers are closed and removed from the pipeline.

 * *CRAQ* (as described in these articles) adds extra complexity that does not 
benefit our specific architecture:
 ** Stateful: Nodes are burdened with tracking membership and reconfiguration 
to keep the pipeline alive indefinitely.
 ** Fully reconfigurable: Adds complexity that our system does not need, 
because we have SCM above the nodes to manage membership.
 *** Note that Delta is using hash based placement, so they require the above 
two points to keep their pipelines alive forever. We do not have this 
requirement.
 ** Write all/read any semantics

While the implementations described here have merits in different 
architectures, I don't see it benefiting to Ozone specifically. Simple 
stateless chained replication still seems like the best way forward for 
replicated data.

> Ozone on CRAQ
> -------------
>
>                 Key: HDDS-12578
>                 URL: https://issues.apache.org/jira/browse/HDDS-12578
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> This is just a long-term wish to explore Chain Replication or CRAQ on Ozone.
> Currently Ozone supports Raft based write pipeline and EC. From the Data 
> replication spectrum 
> ([https://transactional.blog/blog/2024-data-replication-design-spectrum]), 
> these two pipelines cover the Leader-based (Raft based write pipeline) and 
> Quorum-based (EC) replication algorithm. CRAQ falls under 
> Reconfiguration-based replication algorithms. 
> We can consider supporting CRAQ pipelines on Ozone. As mentioned in 
> discussion 
> [https://github.com/apache/ozone/discussions/6870#discussioncomment-9907706], 
> chained replication might be needed for rolling upgrade support. Although 
> CRAQ promised higher bandwidth, higher read performance, and strong 
> consistency, there are some drawbacks such as higher write latency (since all 
> writes need to propagate to the tail), higher downtime during node failure 
> (waiting for the control plane to reconfigure the chains), etc.
> The wish comes from the recent DeepSeek 3FS distributed file system that uses 
> CRAQ as its main write pipeline 
> ([https://github.com/deepseek-ai/3FS/blob/main/docs/design_notes.md]). Other 
> system such as Meta's Delta 
> ([https://engineering.fb.com/2022/05/04/data-infrastructure/delta/]) also 
> uses CRAQ.
> Since it is a Reconfiguration-based replication algorithms, there might be a 
> need to support ZooKeeper-like semantics on top of Ratis or Raft in SCM HA, 
> similar to Clickhouse Keeper ([https://clickhouse.com/clickhouse/keeper]) or 
> Meta's Zelos (https://engineering.fb.com/2022/06/08/developer-tools/zelos/)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-12578) Ozone on CRAQ

Reply via email to