[ 
https://issues.apache.org/jira/browse/HDDS-12578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088106#comment-18088106
 ] 

Tsz-wo Sze edited comment on HDDS-12578 at 6/10/26 10:07 PM:
-------------------------------------------------------------

- CRAQ: if the data is dirty, ask the last node for the latest version. (The 
idea is similar to the Read-Index algorithm.)
 - HDFS: always return up to the visible length, which is the acked length.

!screenshot-1.png|width=700!

Consider the above HDFS write pipeline.  The p's are packets and a's are the 
acks.  The numbers are the packet numbers.  So the acked packets are 5, 7 and 9 
respectively for DN0, DN1 and DN2.  If another client read from DN0 (or DN1), 
it will be able to read the packets up to p5 (or p7).  The client can failover 
to any other datanodes in the pipeline since all datanodes must have the data 
in any acked length.

A potential problem in HDFS:
 - Step 1: client A reads from DN1 and gets the data up to p7
 - Step 2: client A tells client B to read the data.
 - Step 3: client B reads from DN0. It is only able to get packets up to p5 but 
not p7.

Client A and client B are synchronized externally (Step 2).  This is an 
unsupported use case in HDFS.

If we don't support such use case in Ozone, we can use visible length as in 
HDFS. If we really want support it, we may change the intermediate datanodes to 
get the ack number from the last datanode (i.e. CRAQ).

 

 


was (Author: szetszwo):
- CRAQ: if the data is dirty, ask the last node for the latest version. (The 
idea is similar to the Read-Index algorithm.)
 - HDFS: always return up to the visible length, which is the acked length.

!screenshot-1.png|width=800!

Consider the above HDFS write pipeline.  The p's are packets and a's are the 
acks.  The numbers are the packet numbers.  So the acked packets are 5, 7 and 9 
respectively for DN0, DN1 and DN2.  If another client read from DN0 (or DN1), 
it will be able to read the packets up to p5 (or p7).  The client can failover 
to any other datanodes in the pipeline since all datanodes must have the data 
in any acked length.

A potential problem in HDFS:
 - Step 1: client A reads from DN1 and gets the data up to p7
 - Step 2: client A tells client B to read the data.
 - Step 3: client B reads from DN0. It is only able to get packets up to p5 but 
not p7.

Client A and client B are synchronized externally (Step 2).  This is an 
unsupported use case in HDFS.

If we don't support such use case in Ozone, we can use visible length as in 
HDFS. If we really want support it, we may change the intermediate datanodes to 
get the ack number from the last datanode (i.e. CRAQ).

 

 

> Ozone on CRAQ
> -------------
>
>                 Key: HDDS-12578
>                 URL: https://issues.apache.org/jira/browse/HDDS-12578
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> This is just a long-term wish to explore Chain Replication or CRAQ on Ozone.
> Currently Ozone supports Raft based write pipeline and EC. From the Data 
> replication spectrum 
> ([https://transactional.blog/blog/2024-data-replication-design-spectrum]), 
> these two pipelines cover the Leader-based (Raft based write pipeline) and 
> Quorum-based (EC) replication algorithm. CRAQ falls under 
> Reconfiguration-based replication algorithms. 
> We can consider supporting CRAQ pipelines on Ozone. As mentioned in 
> discussion 
> [https://github.com/apache/ozone/discussions/6870#discussioncomment-9907706], 
> chained replication might be needed for rolling upgrade support. Although 
> CRAQ promised higher bandwidth, higher read performance, and strong 
> consistency, there are some drawbacks such as higher write latency (since all 
> writes need to propagate to the tail), higher downtime during node failure 
> (waiting for the control plane to reconfigure the chains), etc.
> The wish comes from the recent DeepSeek 3FS distributed file system that uses 
> CRAQ as its main write pipeline 
> ([https://github.com/deepseek-ai/3FS/blob/main/docs/design_notes.md]). Other 
> system such as Meta's Delta 
> ([https://engineering.fb.com/2022/05/04/data-infrastructure/delta/]) also 
> uses CRAQ.
> Since it is a Reconfiguration-based replication algorithms, there might be a 
> need to support ZooKeeper-like semantics on top of Ratis or Raft in SCM HA, 
> similar to Clickhouse Keeper ([https://clickhouse.com/clickhouse/keeper]) or 
> Meta's Zelos (https://engineering.fb.com/2022/06/08/developer-tools/zelos/)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to