[
https://issues.apache.org/jira/browse/HDDS-12578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088166#comment-18088166
]
Ivan Andika edited comment on HDDS-12578 at 6/11/26 6:19 AM:
-------------------------------------------------------------
[~szetszwo] Thanks for the evaluation.
> CRAQ: if the data is dirty, ask the last node for the latest version. (The
> idea is similar to the Read-Index algorithm.)
>From my understanding, CRAQ asking the last node might not need to be done for
>Ozone for these reasons
- CRAQ read is "V ← read(objID)" and it does not carry the version number. For
Ozone read, we carry the BCSID as a "version number" and we don't need to read
the latest data
- Ozone block is unique (i.e. local ID is uniquely assigned by SCM distributed
sequence ID generator) and it is append only so we can assume that after a read
request for a block I with version 3, we can see the appended data for version
2,1
-- For CRAQ, each object can be replaced by a new object with a higher
version. This might be needed since if we are implementing a distributed object
storage without metadata server (e.g. Ceph OSD)
So I think we can use visible length or BCSID for Ozone. Unless we want to
support some kind of distributed object storage without metadata servers like
Ceph or Minio.
It's been a while I checked out CRAQ logic so I might miss something.
was (Author: JIRAUSER298977):
[~szetszwo] Thanks for the evaluation.
> CRAQ: if the data is dirty, ask the last node for the latest version. (The
> idea is similar to the Read-Index algorithm.)
>From my understanding, CRAQ asking the last node might not need to be done for
>Ozone for these reasons
- CRAQ read is "V ← read(objID)" and it does not carry the version number. For
Ozone read, we carry the BCSID as a "version number" and we don't need to read
the latest data
- Ozone block is unique (i.e. local ID is uniquely assigned by SCM distributed
sequence ID generator) and it is append only so we can assume that after a read
request for a block I with version 3, we can see the appended data for version
2,1
-- For CRAQ, each object can be replaced by a new object with a higher
version. This might be needed since if we are implementing a distributed object
storage without metadata server (e.g. Ceph OSD)
So I think we can use visible length or BCSID for Ozone.
It's been a while I checked out CRAQ logic so I might miss something.
> Ozone on CRAQ
> -------------
>
> Key: HDDS-12578
> URL: https://issues.apache.org/jira/browse/HDDS-12578
> Project: Apache Ozone
> Issue Type: Wish
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: screenshot-1.png
>
>
> This is just a long-term wish to explore Chain Replication or CRAQ on Ozone.
> Currently Ozone supports Raft based write pipeline and EC. From the Data
> replication spectrum
> ([https://transactional.blog/blog/2024-data-replication-design-spectrum]),
> these two pipelines cover the Leader-based (Raft based write pipeline) and
> Quorum-based (EC) replication algorithm. CRAQ falls under
> Reconfiguration-based replication algorithms.
> We can consider supporting CRAQ pipelines on Ozone. As mentioned in
> discussion
> [https://github.com/apache/ozone/discussions/6870#discussioncomment-9907706],
> chained replication might be needed for rolling upgrade support. Although
> CRAQ promised higher bandwidth, higher read performance, and strong
> consistency, there are some drawbacks such as higher write latency (since all
> writes need to propagate to the tail), higher downtime during node failure
> (waiting for the control plane to reconfigure the chains), etc.
> The wish comes from the recent DeepSeek 3FS distributed file system that uses
> CRAQ as its main write pipeline
> ([https://github.com/deepseek-ai/3FS/blob/main/docs/design_notes.md]). Other
> system such as Meta's Delta
> ([https://engineering.fb.com/2022/05/04/data-infrastructure/delta/]) also
> uses CRAQ.
> Since it is a Reconfiguration-based replication algorithms, there might be a
> need to support ZooKeeper-like semantics on top of Ratis or Raft in SCM HA,
> similar to Clickhouse Keeper ([https://clickhouse.com/clickhouse/keeper]) or
> Meta's Zelos (https://engineering.fb.com/2022/06/08/developer-tools/zelos/)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]