This is an automated email from the ASF dual-hosted git repository.

xuanwo pushed a commit to branch checksum
in repository https://gitbox.apache.org/repos/asf/opendal.git

commit e552b152d1284c455eadf41c3fa89fa03bf840e0
Author: Xuanwo <[email protected]>
AuthorDate: Mon Nov 24 22:52:21 2025 +0800

    rfc: Checksum
    
    Signed-off-by: Xuanwo <[email protected]>
---
 core/src/docs/rfcs/0000_checksum.md | 171 ++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)

diff --git a/core/src/docs/rfcs/0000_checksum.md 
b/core/src/docs/rfcs/0000_checksum.md
new file mode 100644
index 000000000..b7dea55da
--- /dev/null
+++ b/core/src/docs/rfcs/0000_checksum.md
@@ -0,0 +1,171 @@
+- Proposal Name: checksum
+- Start Date: 2025-11-24
+- RFC PR: [apache/opendal#0000](https://github.com/apache/opendal/pull/0000)
+- Tracking Issue: 
[apache/opendal#0000](https://github.com/apache/opendal/issues/0000)
+
+# Summary
+
+Add a single full-file checksum abstraction (`Checksum { algo, value }`), 
capability booleans for supported algorithms, write options for user-provided 
checksums, metadata return of the final checksum, and a `ChecksumLayer` that 
can auto-compute and enforce end-to-end verification using a preferred 
algorithm order.
+
+# Motivation
+
+- Give users a storage-agnostic way to attach and receive full-file checksums.
+- Detect corruption or mismatched uploads early by comparing expected vs 
actual values.
+- Provide an opt-in layer to fill gaps where backends cannot verify or return 
checksums.
+- Keep changes minimal and consistent with existing `Capability` boolean style.
+
+# Guide-level explanation
+
+## New concepts
+- `ChecksumAlgo`: algorithms we support (`Crc64Nvme`, `Crc32c`, `Md5`, 
`Sha256`, extensible).
+- `Checksum`: holds exactly one algorithm and the full-file checksum bytes.
+- `ChecksumLayer`: optional layer that computes/checks checksums with a 
preferred algorithm list and an `enforce` flag.
+
+## Examples
+
+### Write with a user-computed checksum (no layer)
+```rust,no_run
+use opendal::services;
+use opendal::{Checksum, ChecksumAlgo, Operator, Result};
+
+fn crc64_nvme_of(data: &[u8]) -> Vec<u8> {
+    // user-side computation (placeholder)
+    vec![0; 8]
+}
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let mut builder = services::Memory::default();
+    let op = Operator::new(builder)?.finish();
+
+    let data = b"hello checksum".to_vec();
+    let expected = Checksum::new(ChecksumAlgo::Crc64Nvme, 
crc64_nvme_of(&data));
+
+    // Backend supports CRC64-NVMe. Mismatch returns 
ErrorKind::ChecksumMismatch.
+    op.write_with("foo.txt", data)
+        .checksum(expected)
+        .await?;
+    Ok(())
+}
+```
+
+### Read and inspect checksum from metadata
+```rust,no_run
+use opendal::services;
+use opendal::{Operator, Result};
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let mut builder = services::Memory::default();
+    let op = Operator::new(builder)?.finish();
+
+    let meta = op.stat("foo.txt").await?;
+    if let Some(cs) = meta.checksum() {
+        println!("algo={:?}, value={:x?}", cs.algo, cs.value);
+    }
+    Ok(())
+}
+```
+
+### Enable end-to-end verification via ChecksumLayer (auto-compute)
+```rust,no_run
+use opendal::layers::ChecksumLayer;
+use opendal::services;
+use opendal::{ChecksumAlgo, Operator, Result};
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let mut builder = services::Memory::default();
+
+    // Prefer CRC64-NVMe, fall back to Sha256. Enforce=true: if backend lacks 
support, compute locally; any mismatch errors out.
+    let op = Operator::new(builder)?
+        .layer(ChecksumLayer::new().preferred(vec![ChecksumAlgo::Crc64Nvme, 
ChecksumAlgo::Sha256]).enforce(true))
+        .finish();
+
+    // User does not provide checksum; layer will compute and attach 
automatically.
+    op.write("bar.bin", b"data".to_vec()).await?;
+
+    // If metadata lacks the preferred checksum, the layer will stream-read 
and compute.
+    let _ = op.read("bar.bin").await?;
+    Ok(())
+}
+```
+
+### Error on mismatch
+```rust,no_run
+use opendal::services;
+use opendal::{Checksum, ChecksumAlgo, Operator, Result, ErrorKind};
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let mut builder = services::Memory::default();
+    let op = Operator::new(builder)?.finish();
+
+    let wrong = Checksum::new(ChecksumAlgo::Sha256, vec![0; 32]);
+    let res = op
+        .write_with("bad.bin", b"payload".to_vec())
+        .checksum(wrong)
+        .await;
+
+    assert!(matches!(res, Err(err) if err.kind() == 
ErrorKind::ChecksumMismatch));
+    Ok(())
+}
+```
+
+# Reference-level explanation
+
+## Data types
+- `ChecksumAlgo`: enum of supported algorithms. Extending this enum is allowed.
+- `Checksum`: `{ algo: ChecksumAlgo, value: Vec<u8> }`; represents the full 
file only.
+- `Metadata`: add `checksum: Option<Checksum>` plus helpers (`checksum()`, 
`crc64_nvme()`, etc.).
+
+## Capability
+- Add boolean fields to `Capability`: `checksum_crc64_nvme`, 
`checksum_crc32c`, `checksum_md5`, `checksum_sha256`.
+- Semantics: `true` means the backend can accept and return that algorithm for 
full-file checksum.
+
+## Write path
+- `WriteOptions` / `OpWrite` gains `checksum: Option<Checksum>`.
+- Flow:
+  1. If `checksum` is provided and its algo flag is `false` in capability, 
return `Unsupported`.
+  2. If supported, pass to backend; mismatch returns `ChecksumMismatch`.
+  3. Response metadata includes the final checksum (from backend or layer).
+
+## Read/stat path
+- If backend provides checksum, fill `Metadata::checksum`.
+- Otherwise leave `None`; `ChecksumLayer` may compute and inject.
+
+## ChecksumLayer
+- Config: `preferred(Vec<ChecksumAlgo>)`, `enforce(bool)`.
+- Selection: pick the first preferred algo whose capability flag is true; if 
none and `enforce=false`, skip; if `enforce=true`, compute locally anyway.
+- Write: if backend cannot verify, stream-compute chosen algo; compare against 
provided `checksum` (if any); mismatch -> `ChecksumMismatch`; inject result 
into returned metadata.
+- Read: if metadata lacks chosen algo, stream-compute; mismatch -> 
`ChecksumMismatch`; if `enforce=true` and cannot obtain, surface `Unsupported` 
or mismatch.
+
+## Errors
+- New `ErrorKind::ChecksumMismatch` for value differences.
+- Unsupported algorithm keeps using existing `Unsupported` error kind.
+
+## Backward compatibility
+- `content_md5` stays; when backends return MD5, it can populate both 
`content_md5` and `checksum(algo=Md5)`.
+- No behavior change for users who ignore checksum features.
+
+# Drawbacks
+- More boolean fields in `Capability`; adding many algorithms enlarges the 
struct.
+- `ChecksumLayer` can add CPU cost for large objects when enforce is enabled.
+
+# Rationale and alternatives
+- Chose capability booleans to match existing style and keep `Capability: 
Copy`.
+- Rejected multi-checksum containers to keep the surface small and semantics 
single-valued.
+- Rejected HashSet/bitmask because booleans are already the established 
pattern in `Capability`.
+
+# Prior art
+- Cloud SDKs commonly expose a single MD5/CRC32C field (e.g., GCS, OSS); we 
generalize to multiple algorithms via booleans.
+- Middleware-style checksum verification mirrors S3 client behaviors but made 
storage-agnostic here.
+
+# Unresolved questions
+- Default preferred order for `ChecksumLayer` (proposed: `Crc64Nvme`, 
`Sha256`, `Crc32c`, `Md5`).
+- Per-backend capability matrix (which algorithms to mark true by default).
+
+# Future possibilities
+- Add more algorithms (e.g., `Sha1`) with new booleans.
+- Optional `reverify_on_read` flag in `ChecksumLayer` to recompute even when a 
checksum exists.
+- Expose checksum info in presign responses when services support checksum 
headers.

Reply via email to