ivandika3 commented on code in PR #9334:
URL: https://github.com/apache/ozone/pull/9334#discussion_r2629162910


##########
hadoop-hdds/docs/content/design/s3-conditional-requests.md:
##########
@@ -0,0 +1,194 @@
+---
+title: "S3 Conditional Requests"
+summary: Design to support S3 conditional requests for atomic operations.
+date: 2025-11-20
+jira: HDDS-13117
+status: draft
+author: Chu Cheng Li
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# S3 Conditional Requests Design
+
+## Background
+
+AWS S3 supports conditional requests using HTTP conditional headers, enabling 
atomic operations, cache optimization, and preventing race conditions. This 
includes:
+
+- **Conditional Writes** (PutObject): `If-Match` and `If-None-Match` headers 
for atomic operations
+- **Conditional Reads** (GetObject, HeadObject): `If-Match`, `If-None-Match`, 
`If-Modified-Since`, `If-Unmodified-Since` for cache validation
+- **Conditional Copy** (CopyObject): Conditions on both source and destination 
objects
+
+### Current State
+
+- HDDS-10656 implemented atomic rewrite using `expectedDataGeneration`
+- OM HA uses single Raft group with single applier thread (Ratis 
StateMachineUpdater)
+- S3 gateway doesn't expose conditional headers to OM layer
+
+## Use Cases
+
+### Conditional Writes
+
+- **Atomic key rewrites**: Prevent race conditions when updating existing 
objects
+- **Create-only semantics**: Prevent accidental overwrites (`If-None-Match: *`)
+- **Optimistic locking**: Enable concurrent access with conflict detection
+- **Leader election**: Implement distributed coordination using S3 as backing 
store
+
+### Conditional Reads
+
+- **Bandwidth optimization**: Avoid downloading unchanged objects (304 Not 
Modified)
+- **HTTP caching**: Support standard browser/CDN caching semantics
+- **Conditional processing**: Only process objects that meet specific criteria
+
+### Conditional Copy
+
+- **Atomic copy operations**: Copy only if source/destination meets specific 
conditions
+- **Prevent overwrite**: Copy only if destination doesn't exist
+
+## Specification
+
+### AWS S3 Conditional Write Specification
+
+#### If-None-Match Header
+
+```
+If-None-Match: "*"
+```
+
+- Succeeds only if object does NOT exist
+- Returns `412 Precondition Failed` if object exists
+- Primary use case: Create-only semantics
+
+#### If-Match Header
+
+```
+If-Match: "<etag>"
+```
+
+- Succeeds only if object EXISTS and ETag matches
+- Returns `412 Precondition Failed` if object doesn't exist or ETag mismatches
+- Primary use case: Atomic updates (compare-and-swap)
+
+#### Restrictions
+
+- Cannot use both headers together in same request
+- No additional charges for failed conditional requests
+
+### AWS S3 Conditional Read Specification
+
+TODO
+
+### AWS S3 Conditional Copy Specification
+
+TODO
+
+## Implementation
+
+### AWS S3 Conditional Write Implementation
+
+The implementation aims to minimize Redundant RPCs (RTT) while ensuring strict 
atomicity for conditional operations.
+
+- **If-None-Match** utilizes the atomic "Create-If-Not-Exists" capability 
([HDDS-13963](https://issues.apache.org/jira/browse/HDDS-13963 "null")).
+- **If-Match** optimizes the happy path by pushing ETag validation directly 
into the Ozone Manager's write path, avoiding preliminary read operations.
+
+#### If-None-Match Implementation
+
+This implementation ensures strict create-only semantics by utilizing a 
specific generation ID marker.
+
+In `OzoneConsts.java`, add the `-1` as a constant for readability:
+```java
+/**
+ * Special value for expectedDataGeneration to indicate "Create-If-Not-Exists" 
semantics.
+ * When used with If-None-Match conditional requests, this ensures atomicity:
+ * if a concurrent write commits between Create and Commit phases, the commit
+ * fails the validation check, preserving strict create-if-not-exists 
semantics.
+ */
+public static final long EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS = -1L;
+```
+
+##### S3 Gateway Layer
+
+1. Parse `If-None-Match: *`.
+2. Set `existingKeyGeneration = 
OzoneConsts.EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS`.
+3. Call `RpcClient.rewriteKey()`.
+
+##### OM Create Phase
+
+1. OM receives request with `expectedDataGeneration == 
OzoneConsts.EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS`.
+2. **Pre-check**: If key is already in the OpenKeyTable or KeyTable, throw 
`KEY_ALREADY_EXISTS`.
+3. If not exists, proceed to create the open key entry.
+
+##### OM Commit Phase (Atomicity)
+
+1. During the commit phase (or strict atomic create), the OM validates that 
the key still does not exist.
+2. If a concurrent client created the key between the Create and Commit 
phases, the transaction fails with `KET_GENERATION_MISMATCH`.
+
+##### Race Condition Handling
+
+Using `OzoneConsts.EXPECTED_DATA_GENERATION_CREATE_IF_NOT_EXISTS = -1` ensures 
atomicity. If a concurrent write (Client B) commits between Client A's Create 
and Commit,
+Client A's commit fails the `CREATE IF NOT EXISTS` validation check, 
preserving strict create-if-not-exists semantics.
+
+> **Note**: This ability will be added along with 
[HDDS-13963](https://issues.apache.org/jira/browse/HDDS-13963) (Atomic 
Create-If-Not-Exists).
+
+#### If-Match Implementation
+
+To optimize performance and reduce latency, we avoid a pre-flight check 
(GetS3KeyDetails) and instead validate the ETag during the OM Write operation.
+This requires adding an optional `expectedETag` field to `KeyArgs`. This 
approach optimizes the "happy path" (successful match) by removing an extra 
network round trip.
+For failing requests, they still incur the cost of a write RPC and Raft log 
entry, but this is acceptable under optimistic concurrency control assumptions.
+
+##### S3 Gateway Layer
+
+1. Parse `If-Match: "<etag>"` header.
+3. Populate `KeyArgs` with the parsed `expectedETag`.
+4. Send the write request (CreateKey/OpenKey) to OM.
+
+##### OM Layer (Validation Logic)
+
+Validation is performed within the `validateAndUpdateCache` method to ensure 
atomicity within the Ratis state machine application.
+
+1. **Locking**: The OM acquires the write lock for the bucket/key.
+2. **Key Lookup**: Retrieve the existing key from `KeyTable`.
+3. **Validation**:
+
+    - **Key Not Found**: If the key does not exist, throw `KEY_NOT_FOUND` 
(maps to S3 412).
+    - **No ETag Metadata**: If the existing key (e.g., uploaded via OFS) does 
not have an ETag property, skip ETag validation and allow the operation to 
proceed. This ensures compatibility with mixed access patterns (OFS and S3A) 
where S3 Conditional Writes are primarily intended for pure S3 use cases. We do 
**not** calculate ETag on the spot to avoid performance overhead on the applier 
thread.

Review Comment:
   As it stands now, ETag is not always written when uploading a key (e.g. OFS 
users or using OzoneClient directly). We can technically support ETag on all 
write, but the issue are
   1. ETag includes calculating hash (e.g. md5) which adds overhead and if it's 
never used by users like OFS, it becomes an unnecessary overhead.
   2. Old clients will still upload without setting ETag. So unless we always 
calculate ETag in OM (which is a bad idea), we cannot ensure that all keys have 
ETag.
   3. Old keys do not always have ETag (keys created before the ETag feature is 
deployed): So unless we spin up a new cluster or we run a finalization that 
calculates every single key ETag (which is expensive), it's not feasible.
   
   In Ozone, the S3 compatibility for LEGACY or FSO buckets is "best-effort" 
meaning we try to support S3 as much as possible, but there will be limitations 
(compare to pure S3 object storage). Since for OBS buckets can only be used by 
S3 users, the conditional write guarantee here is stronger. If LEGACY or FSO 
buckets are only used by S3 users, then the conditional write is stronger.
   
   In the end, it's a tradeoff between predictability vs safety. We want users 
that uses FSO / LEGACY bucket to be able to coexist with S3 users without 
throwing any unexpected exceptions. On the other hand we also want to ensure 
the safety contract is respected. That said, if new S3G talks to old OM without 
any version checks, the OM would also ignore the conditional write behavior 
without throwing exceptions.
   
   We can simply document this behavior. I'm fine if the community decides to 
prioritize safety, but personally I prefer predictability.
   
   Answering @sodonnel questions
   
   > For S3 originated writes, are we always storing etags? 
   
   Yes, but keys uploaded before ETag feature is deployed will not have ETags
   
   > Where do the etags come from?
   
   From S3G
   
   > Could non-s3 originated writes also set an etag easily? 
   
   Possible, but this requires changing KeyOutputStream to calculate ETag all 
the time. 
   
   > It feels like etags could be a summation of the CRC checksums of a block.
   
   ETag technically can be anything that uniquely identify an object, but 
currently for normal (non-MPU) key it's MD5 hash (since current AWS S3 
behavior) while for MPU key it's not MD5 (IIRC it's hash all the MD5 of all 
parts).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to