This is an automated email from the ASF dual-hosted git repository. xuanwo pushed a commit to branch checksum in repository https://gitbox.apache.org/repos/asf/opendal.git
commit e552b152d1284c455eadf41c3fa89fa03bf840e0 Author: Xuanwo <[email protected]> AuthorDate: Mon Nov 24 22:52:21 2025 +0800 rfc: Checksum Signed-off-by: Xuanwo <[email protected]> --- core/src/docs/rfcs/0000_checksum.md | 171 ++++++++++++++++++++++++++++++++++++ 1 file changed, 171 insertions(+) diff --git a/core/src/docs/rfcs/0000_checksum.md b/core/src/docs/rfcs/0000_checksum.md new file mode 100644 index 000000000..b7dea55da --- /dev/null +++ b/core/src/docs/rfcs/0000_checksum.md @@ -0,0 +1,171 @@ +- Proposal Name: checksum +- Start Date: 2025-11-24 +- RFC PR: [apache/opendal#0000](https://github.com/apache/opendal/pull/0000) +- Tracking Issue: [apache/opendal#0000](https://github.com/apache/opendal/issues/0000) + +# Summary + +Add a single full-file checksum abstraction (`Checksum { algo, value }`), capability booleans for supported algorithms, write options for user-provided checksums, metadata return of the final checksum, and a `ChecksumLayer` that can auto-compute and enforce end-to-end verification using a preferred algorithm order. + +# Motivation + +- Give users a storage-agnostic way to attach and receive full-file checksums. +- Detect corruption or mismatched uploads early by comparing expected vs actual values. +- Provide an opt-in layer to fill gaps where backends cannot verify or return checksums. +- Keep changes minimal and consistent with existing `Capability` boolean style. + +# Guide-level explanation + +## New concepts +- `ChecksumAlgo`: algorithms we support (`Crc64Nvme`, `Crc32c`, `Md5`, `Sha256`, extensible). +- `Checksum`: holds exactly one algorithm and the full-file checksum bytes. +- `ChecksumLayer`: optional layer that computes/checks checksums with a preferred algorithm list and an `enforce` flag. + +## Examples + +### Write with a user-computed checksum (no layer) +```rust,no_run +use opendal::services; +use opendal::{Checksum, ChecksumAlgo, Operator, Result}; + +fn crc64_nvme_of(data: &[u8]) -> Vec<u8> { + // user-side computation (placeholder) + vec![0; 8] +} + +#[tokio::main] +async fn main() -> Result<()> { + let mut builder = services::Memory::default(); + let op = Operator::new(builder)?.finish(); + + let data = b"hello checksum".to_vec(); + let expected = Checksum::new(ChecksumAlgo::Crc64Nvme, crc64_nvme_of(&data)); + + // Backend supports CRC64-NVMe. Mismatch returns ErrorKind::ChecksumMismatch. + op.write_with("foo.txt", data) + .checksum(expected) + .await?; + Ok(()) +} +``` + +### Read and inspect checksum from metadata +```rust,no_run +use opendal::services; +use opendal::{Operator, Result}; + +#[tokio::main] +async fn main() -> Result<()> { + let mut builder = services::Memory::default(); + let op = Operator::new(builder)?.finish(); + + let meta = op.stat("foo.txt").await?; + if let Some(cs) = meta.checksum() { + println!("algo={:?}, value={:x?}", cs.algo, cs.value); + } + Ok(()) +} +``` + +### Enable end-to-end verification via ChecksumLayer (auto-compute) +```rust,no_run +use opendal::layers::ChecksumLayer; +use opendal::services; +use opendal::{ChecksumAlgo, Operator, Result}; + +#[tokio::main] +async fn main() -> Result<()> { + let mut builder = services::Memory::default(); + + // Prefer CRC64-NVMe, fall back to Sha256. Enforce=true: if backend lacks support, compute locally; any mismatch errors out. + let op = Operator::new(builder)? + .layer(ChecksumLayer::new().preferred(vec![ChecksumAlgo::Crc64Nvme, ChecksumAlgo::Sha256]).enforce(true)) + .finish(); + + // User does not provide checksum; layer will compute and attach automatically. + op.write("bar.bin", b"data".to_vec()).await?; + + // If metadata lacks the preferred checksum, the layer will stream-read and compute. + let _ = op.read("bar.bin").await?; + Ok(()) +} +``` + +### Error on mismatch +```rust,no_run +use opendal::services; +use opendal::{Checksum, ChecksumAlgo, Operator, Result, ErrorKind}; + +#[tokio::main] +async fn main() -> Result<()> { + let mut builder = services::Memory::default(); + let op = Operator::new(builder)?.finish(); + + let wrong = Checksum::new(ChecksumAlgo::Sha256, vec![0; 32]); + let res = op + .write_with("bad.bin", b"payload".to_vec()) + .checksum(wrong) + .await; + + assert!(matches!(res, Err(err) if err.kind() == ErrorKind::ChecksumMismatch)); + Ok(()) +} +``` + +# Reference-level explanation + +## Data types +- `ChecksumAlgo`: enum of supported algorithms. Extending this enum is allowed. +- `Checksum`: `{ algo: ChecksumAlgo, value: Vec<u8> }`; represents the full file only. +- `Metadata`: add `checksum: Option<Checksum>` plus helpers (`checksum()`, `crc64_nvme()`, etc.). + +## Capability +- Add boolean fields to `Capability`: `checksum_crc64_nvme`, `checksum_crc32c`, `checksum_md5`, `checksum_sha256`. +- Semantics: `true` means the backend can accept and return that algorithm for full-file checksum. + +## Write path +- `WriteOptions` / `OpWrite` gains `checksum: Option<Checksum>`. +- Flow: + 1. If `checksum` is provided and its algo flag is `false` in capability, return `Unsupported`. + 2. If supported, pass to backend; mismatch returns `ChecksumMismatch`. + 3. Response metadata includes the final checksum (from backend or layer). + +## Read/stat path +- If backend provides checksum, fill `Metadata::checksum`. +- Otherwise leave `None`; `ChecksumLayer` may compute and inject. + +## ChecksumLayer +- Config: `preferred(Vec<ChecksumAlgo>)`, `enforce(bool)`. +- Selection: pick the first preferred algo whose capability flag is true; if none and `enforce=false`, skip; if `enforce=true`, compute locally anyway. +- Write: if backend cannot verify, stream-compute chosen algo; compare against provided `checksum` (if any); mismatch -> `ChecksumMismatch`; inject result into returned metadata. +- Read: if metadata lacks chosen algo, stream-compute; mismatch -> `ChecksumMismatch`; if `enforce=true` and cannot obtain, surface `Unsupported` or mismatch. + +## Errors +- New `ErrorKind::ChecksumMismatch` for value differences. +- Unsupported algorithm keeps using existing `Unsupported` error kind. + +## Backward compatibility +- `content_md5` stays; when backends return MD5, it can populate both `content_md5` and `checksum(algo=Md5)`. +- No behavior change for users who ignore checksum features. + +# Drawbacks +- More boolean fields in `Capability`; adding many algorithms enlarges the struct. +- `ChecksumLayer` can add CPU cost for large objects when enforce is enabled. + +# Rationale and alternatives +- Chose capability booleans to match existing style and keep `Capability: Copy`. +- Rejected multi-checksum containers to keep the surface small and semantics single-valued. +- Rejected HashSet/bitmask because booleans are already the established pattern in `Capability`. + +# Prior art +- Cloud SDKs commonly expose a single MD5/CRC32C field (e.g., GCS, OSS); we generalize to multiple algorithms via booleans. +- Middleware-style checksum verification mirrors S3 client behaviors but made storage-agnostic here. + +# Unresolved questions +- Default preferred order for `ChecksumLayer` (proposed: `Crc64Nvme`, `Sha256`, `Crc32c`, `Md5`). +- Per-backend capability matrix (which algorithms to mark true by default). + +# Future possibilities +- Add more algorithms (e.g., `Sha1`) with new booleans. +- Optional `reverify_on_read` flag in `ChecksumLayer` to recompute even when a checksum exists. +- Expose checksum info in presign responses when services support checksum headers.
