[ 
https://issues.apache.org/jira/browse/HDDS-12578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088166#comment-18088166
 ] 

Ivan Andika edited comment on HDDS-12578 at 6/11/26 6:16 AM:
-------------------------------------------------------------

[~szetszwo] Thanks for the evaluation.

> CRAQ: if the data is dirty, ask the last node for the latest version. (The 
> idea is similar to the Read-Index algorithm.)

>From my understanding, CRAQ asking the last node might not need to be done for 
>Ozone for these reasons
 - CRAQ read is "V ← read(objID)" and it does not carry the version number. For 
Ozone read, we carry the BCSID as a "version number" and we don't need to read 
the latest data
 - Ozone block is unique (i.e. local ID is uniquely assigned by SCM distributed 
sequence ID generator) and it is append only so we can assume that after a read 
request for a block I with version 3, we can see the appended data for version 
2,1
 -- For CRAQ, each object can be replaced by a new object with a higher 
version. This might be needed since if we are implementing a distributed object 
storage without metadata server (e.g. Ceph OSD)

So I think we can use visible length or BCSID for Ozone.

It's been a while I checked out CRAQ logic so I might miss something.


was (Author: JIRAUSER298977):
[~szetszwo] Thanks for the evaluation.

> CRAQ: if the data is dirty, ask the last node for the latest version. (The 
> idea is similar to the Read-Index algorithm.)

>From my understanding, CRAQ asking the last node might not need to be done for 
>Ozone for these reasons
 - CRAQ read is "V ← read(objID)" and it does not carry the version number. For 
Ozone read, we carry the BCSID as a "version number" and we don't need to read 
the latest data
 - Ozone block is unique (i.e. local ID is uniquely assigned by SCM distributed 
sequence ID generator) and it is append only so we can assume that after a read 
request for a block I with version 3, we can see the appended data for version 
2,1
 - For CRAQ, each object can be replaced by a new object with a higher version. 
This might be needed since if we are implementing a distributed object storage 
without metadata server (e.g. Ceph OSD)

So I think we can use visible length or BCSID for Ozone.

It's been a while I checked out CRAQ logic so I might miss something.

> Ozone on CRAQ
> -------------
>
>                 Key: HDDS-12578
>                 URL: https://issues.apache.org/jira/browse/HDDS-12578
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> This is just a long-term wish to explore Chain Replication or CRAQ on Ozone.
> Currently Ozone supports Raft based write pipeline and EC. From the Data 
> replication spectrum 
> ([https://transactional.blog/blog/2024-data-replication-design-spectrum]), 
> these two pipelines cover the Leader-based (Raft based write pipeline) and 
> Quorum-based (EC) replication algorithm. CRAQ falls under 
> Reconfiguration-based replication algorithms. 
> We can consider supporting CRAQ pipelines on Ozone. As mentioned in 
> discussion 
> [https://github.com/apache/ozone/discussions/6870#discussioncomment-9907706], 
> chained replication might be needed for rolling upgrade support. Although 
> CRAQ promised higher bandwidth, higher read performance, and strong 
> consistency, there are some drawbacks such as higher write latency (since all 
> writes need to propagate to the tail), higher downtime during node failure 
> (waiting for the control plane to reconfigure the chains), etc.
> The wish comes from the recent DeepSeek 3FS distributed file system that uses 
> CRAQ as its main write pipeline 
> ([https://github.com/deepseek-ai/3FS/blob/main/docs/design_notes.md]). Other 
> system such as Meta's Delta 
> ([https://engineering.fb.com/2022/05/04/data-infrastructure/delta/]) also 
> uses CRAQ.
> Since it is a Reconfiguration-based replication algorithms, there might be a 
> need to support ZooKeeper-like semantics on top of Ratis or Raft in SCM HA, 
> similar to Clickhouse Keeper ([https://clickhouse.com/clickhouse/keeper]) or 
> Meta's Zelos (https://engineering.fb.com/2022/06/08/developer-tools/zelos/)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to