[
https://issues.apache.org/jira/browse/HDDS-12578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947413#comment-17947413
]
Ethan Rose commented on HDDS-12578:
-----------------------------------
After 12k Jiras this one probably has my favorite title : )
It helps to look at what properties we need out of our write pipeline:
* In the context of rolling upgrades, the easiest implementation to support is
stateless pipelines, where the pipeline is a group of nodes chosen on the fly
by the management layer (SCM for Ozone, NameNode for HDFS) for each write
operation. This way there is no extra state management (like closing pipelines)
required when a node is shut down for upgrade and no ripple affect through the
cluster where peers need to behave differently (close pipeline, potentially be
added to new pipelines to compensate) just because a different node was shut
down.
* We don't need dynamic pipeline reconfiguration by the pipeline members
themselves. We already have a consensus layer above our datanodes (SCM) that
can easily handle "reconfiguration" by just giving a new set of nodes when
allocates a new block because the last one failed to write.
* Write all/read any semantics for durability and performance. Slow nodes in a
write pipeline can be identified through reporting and either excluded, have
load reduced, or alerted to the admin.
Then we can look at implementations that meet these requirements:
* *HDFS chained replication* meets all of these requirements:
** Stateless: nodes to service a request are chosen on the fly.
** No reconfiguration logic required: Clients request a new pipeline with an
exclude list of nodes when a write fails, and they can pick up where they left
off.
*** IIRC there is some fine print here, where HDFS may try to reconfigure the
pipeline to avoid making the client rewrite the block. Ozone does not have this
problem since we can partially fill a block and pick up where we left off in a
new block if needed.
** Write all/read any semantics
* *Ozone Ratis pipelines* do not meet these requirements very well:
** Stateful: tracked on datanodes and in SCM.
** Partially reconfigurable (leader/follower) but does not support membership
changes.
** Supports partial writes with majority commit semantics, which causes
complications when containers are closed and removed from the pipeline.
* *CRAQ* (as described in these articles) adds extra complexity that does not
benefit our specific architecture:
** Stateful: Nodes are burdened with tracking membership and reconfiguration
to keep the pipeline alive indefinitely.
** Fully reconfigurable: Adds complexity that our system does not need,
because we have SCM above the nodes to manage membership.
*** Note that Delta is using hash based placement, so they require the above
two points to keep their pipelines alive forever. We do not have this
requirement.
** Write all/read any semantics
While the implementations described here have merits in different
architectures, I don't see it benefiting to Ozone specifically. Simple
stateless chained replication still seems like the best way forward for
replicated data.
> Ozone on CRAQ
> -------------
>
> Key: HDDS-12578
> URL: https://issues.apache.org/jira/browse/HDDS-12578
> Project: Apache Ozone
> Issue Type: Wish
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> This is just a long-term wish to explore Chain Replication or CRAQ on Ozone.
> Currently Ozone supports Raft based write pipeline and EC. From the Data
> replication spectrum
> ([https://transactional.blog/blog/2024-data-replication-design-spectrum]),
> these two pipelines cover the Leader-based (Raft based write pipeline) and
> Quorum-based (EC) replication algorithm. CRAQ falls under
> Reconfiguration-based replication algorithms.
> We can consider supporting CRAQ pipelines on Ozone. As mentioned in
> discussion
> [https://github.com/apache/ozone/discussions/6870#discussioncomment-9907706],
> chained replication might be needed for rolling upgrade support. Although
> CRAQ promised higher bandwidth, higher read performance, and strong
> consistency, there are some drawbacks such as higher write latency (since all
> writes need to propagate to the tail), higher downtime during node failure
> (waiting for the control plane to reconfigure the chains), etc.
> The wish comes from the recent DeepSeek 3FS distributed file system that uses
> CRAQ as its main write pipeline
> ([https://github.com/deepseek-ai/3FS/blob/main/docs/design_notes.md]). Other
> system such as Meta's Delta
> ([https://engineering.fb.com/2022/05/04/data-infrastructure/delta/]) also
> uses CRAQ.
> Since it is a Reconfiguration-based replication algorithms, there might be a
> need to support ZooKeeper-like semantics on top of Ratis or Raft in SCM HA,
> similar to Clickhouse Keeper ([https://clickhouse.com/clickhouse/keeper]) or
> Meta's Zelos (https://engineering.fb.com/2022/06/08/developer-tools/zelos/)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]