Re: [PR] HDDS-13919. S3 Conditional Writes (PutObject) [ozone]

via GitHub Mon, 24 Nov 2025 21:27:06 -0800


ivandika3 commented on code in PR #9334:
URL: https://github.com/apache/ozone/pull/9334#discussion_r2558578333



##########
hadoop-hdds/docs/content/design/s3-conditional-requests.md:
##########
@@ -0,0 +1,149 @@
+---
+title: "S3 Conditional Requests"
+summary: Design to support S3 conditional requests for atomic operations.
+date: 2025-11-20
+jira: HDDS-13117
+status: draft
+author: Chu Cheng Li
+---
+
+# S3 Conditional Requests Design
+
+## Background
+
+AWS S3 supports conditional requests using HTTP conditional headers, enabling 
atomic operations, cache optimization, and preventing race conditions. This 
includes:
+
+- **Conditional Writes** (PutObject): `If-Match` and `If-None-Match` headers 
for atomic operations
+- **Conditional Reads** (GetObject, HeadObject): `If-Match`, `If-None-Match`, 
`If-Modified-Since`, `If-Unmodified-Since` for cache validation
+- **Conditional Copy** (CopyObject): Conditions on both source and destination 
objects
+
+### Current State
+
+- HDDS-10656 implemented atomic rewrite using `expectedDataGeneration`
+- OM HA uses single Raft group with single applier thread (Ratis 
StateMachineUpdater)
+- S3 gateway doesn't expose conditional headers to OM layer
+
+## Use Cases
+
+### Conditional Writes
+- **Atomic key rewrites**: Prevent race conditions when updating existing 
objects
+- **Create-only semantics**: Prevent accidental overwrites (`If-None-Match: *`)
+- **Optimistic locking**: Enable concurrent access with conflict detection
+- **Leader election**: Implement distributed coordination using S3 as backing 
store
+
+### Conditional Reads
+- **Bandwidth optimization**: Avoid downloading unchanged objects (304 Not 
Modified)
+- **HTTP caching**: Support standard browser/CDN caching semantics
+- **Conditional processing**: Only process objects that meet specific criteria
+
+### Conditional Copy
+- **Atomic copy operations**: Copy only if source/destination meets specific 
conditions
+- **Prevent overwrite**: Copy only if destination doesn't exist
+
+## AWS S3 Conditional Write
+
+### Specification
+
+#### If-None-Match Header
+
+```
+If-None-Match: "*"
+```
+
+- Succeeds only if object does NOT exist
+- Returns `412 Precondition Failed` if object exists
+- Primary use case: Create-only semantics
+
+#### If-Match Header
+
+```
+If-Match: "<etag>"
+```
+
+- Succeeds only if object EXISTS and ETag matches
+- Returns `412 Precondition Failed` if object doesn't exist or ETag mismatches
+- Primary use case: Atomic updates (compare-and-swap)
+
+#### Restrictions
+
+- Cannot use both headers together in same request
+- No additional charges for failed conditional requests
+
+### Implementation
+
+#### Architecture Overview
+
+#### If-None-Match Implementation
+
+##### S3 Gateway Layer
+
+1. Parse `If-None-Match: *`.
+2. Set `existingKeyGeneration = -1`.
+3. Call `RpcClient.rewriteKey()`.
+
+##### OM Create Phase
+
+1. Validate `expectedDataGeneration == -1`.
+2. If key exists → throw `KEY_ALREADY_EXISTS`.
+3. Store `-1` in open key metadata.
+
+##### OM Commit Phase
+
+1. Check `expectedDataGeneration == -1` from open key.
+2. If key now exists (race condition) → throw `KEY_ALREADY_EXISTS`.
+3. Commit key.
+
+##### Race Condition Handling
+
+Using `-1` ensures atomicity. If a concurrent write (Client B) commits between 
Client A's Create and Commit, Client A's commit fails the `-1` validation check 
(key now exists), preserving strict create-if-not-exists semantics.
+
+#### If-Match Implementation
+
+Leverages existing `expectedDataGeneration` from HDDS-10656:
+
+##### S3 Gateway Layer
+
+1. Parse `If-Match: "<etag>"` header
+2. Look up existing key via `getS3KeyDetails()`
+3. Validate ETag matches, else throw `PRECOND_FAILED` (412)
+4. Extract `expectedGeneration` from existing key
+5. Pass `expectedGeneration` to RpcClient

Review Comment:
   Just for my understanding, the reason of calling `getS3KeyDetails` is to not 
send a write request if precondition failed and therefore there is no Raft log 
and will not block applier thread? 
   
   I think this tradeoff is whether we want to prioritize the optimize the 
latency happy path (precondition pass) or the precondition failed path 
(precondition failed). IMO, in normal workloads (and under optimistic 
concurrenc control), we assume that the happy path should happen more often and 
therefore we can validate the ETag key metadata during the key write. This will 
add another optional field of KeyArgs (e.g. `expectedETag`), but I think it's 
fine.
   
   Please also note that not all Ozone keys will have ETag (e.g. keys uploaded 
using OFS protocol), so we might want to specify whether we want to 1) skip the 
keys without ETag metadata or 2) calculate the ETag on the spot. I prefer (1) 
since it's the most lightweight implementation. Approach (2) might justify your 
approach of loading the key and calculating the ETag in S3G instead in OM 
applier thread, but there might be some overhead and also for MPU key the 
calculation of ETag is more complex (it's not just MD5).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-13919. S3 Conditional Writes (PutObject) [ozone]

Reply via email to