yihua commented on code in PR #14286:
URL: https://github.com/apache/hudi/pull/14286#discussion_r2561611440


##########
rfc/rfc-91/rfc-91.md:
##########
@@ -130,28 +132,86 @@ We can use the same logic for preconditions with 
overwrite operations using the
 
 ## Test Plan
 
-We can write normal junit tests using testcontainers with GCS and S3 to 
simulate edge cases and general contention. Further adhoc testing will include 
the following scenarios:
+We can write normal junit tests using testcontainers with GCS and S3 to 
simulate edge cases and general contention.
 
 ### Unit tests
 
 We will add some high contention, high usage unit tests that create hundreds 
of threads to try and acquire locks simultaneously on the testcontainers to 
simulate load and contention. We can also use thread-unsafe structures like 
Arraylists to ensure concurrent modifications do not occur.
 
-### High-Frequency Commit and Table Service Test
 
-Run a long-running streaming ingestion process that continuously performs 
inserts, updates, and deletes. Ensure that frequent commits occur while table 
services like compaction and clustering operate concurrently. This test will 
help verify that the lock provider can handle overlapping operations without 
causing excessive delays or lock contention.
+## RFC-91-update Nov 2025: Storage LP conditional write retry handling
+
+Conditional writes with s3 which are retried due to transient issues are not 
idempotent, meaning that a successful PUT can go through but a retry can turn 
the response into a 412. This is an extremely, extremely rare case. It was not 
able to be reproduced, but was observed in production.
+
+To reproduce, we needed a PUT request to succeed on S3 (changing the lock 
owner) while the client never receives the success response (thinks it failed). 
However, with artifical network-level toxics between the client and S3, we face 
a dilemma: blocking/delaying responses either prevents the PUT from completing 
on S3 (timeout toxic stops everything), or the SDK's aggressive retry logic 
eventually succeeds after waiting through the delays (latency toxic). We can't 
surgically "lose" just the response after the PUT completes because TCP 
requires bidirectional communication for the operation to succeed on S3.
+
+![dfs-retry](./dfs-retry.png)
+
+The lock provider must assume that any 412 received can indicate a successful 
write, but the 412 is a transient retry which failed.
+
+We need to handle this separately in 3 cases:
+
+### during tryLock
+Any check we do upon 412 after attempting to acquire the lock immediately 
becomes a safety violation.
+
+#### Why Checking After Is Unsafe
+
+The key insight is the time window between the write and any verification 
check:
+```
+  Time    Client A                S3 State                Client B
+  ----    --------                --------                --------
+  T0      PUT if-none-match:*     
+  T1                              File created (A's data) 
+  T2                              Returns 412 to A        
+  T3      Receives 412            
+  T4      Thinks: "I failed"      
+          
+          !!! IF CLIENT A CHECKS HERE:
+  T5      GET to verify           
+          
+          BUT between T1 and T5, what could have happened?
+          
+          Scenario 1: Lock expired, Client B acquired it
+                                                          Acquired lock (B's 
data)
+  T5      Reads lock              Lock = Client B
+          Sees Client B's data
+          
+          Scenario 2: Lock still has A's data
+  T5      Reads lock              Lock = Client A
+          Sees own data
+```
+
+Therefore, we have to live with this bad scenario which will produce dangling 
locks. Hopefully AWS can fix it. Note, we can add logic improvements to 
conditionally reject the transaction:
+
+```
+412 received, waiting for current owner to release the lock
+- oh wait, we are the owner! How did this happen?
+- Conditionally release the ghost lock and return false from tryLock()
+  - If we are unable to release the lock that's fine. This is the exact safety 
violation we are trying to prevent.
+  - emit a metric that indicates this inconsistency
+- This avoids the 5min lock orphaning.
+```
+
+### during renewLock
+Within the renewal section of the storage based lock provider, we assume that 
any 412 error means someone else has taken the lock. This scenario is 
irrecoverable, as we do not know the latest lock's ETag. Therefore, we have to 
handle this in renewLock()
+
+The best way forward here, is to fetch the lock again after renew lock fails 
with 412.
+
+#### If the owner in s3 does not match our owner
+The lock provider has failed as two writers acquired the lock and there is 
nothing we can do other than set off alarm bells and exit the transaction.
 
-### Concurrent SQL and Spark Operations Test

Review Comment:
   Should we keep existing testing plans?



##########
rfc/rfc-91/rfc-91.md:
##########
@@ -130,28 +132,86 @@ We can use the same logic for preconditions with 
overwrite operations using the
 
 ## Test Plan
 
-We can write normal junit tests using testcontainers with GCS and S3 to 
simulate edge cases and general contention. Further adhoc testing will include 
the following scenarios:
+We can write normal junit tests using testcontainers with GCS and S3 to 
simulate edge cases and general contention.
 
 ### Unit tests
 
 We will add some high contention, high usage unit tests that create hundreds 
of threads to try and acquire locks simultaneously on the testcontainers to 
simulate load and contention. We can also use thread-unsafe structures like 
Arraylists to ensure concurrent modifications do not occur.
 
-### High-Frequency Commit and Table Service Test
 
-Run a long-running streaming ingestion process that continuously performs 
inserts, updates, and deletes. Ensure that frequent commits occur while table 
services like compaction and clustering operate concurrently. This test will 
help verify that the lock provider can handle overlapping operations without 
causing excessive delays or lock contention.
+## RFC-91-update Nov 2025: Storage LP conditional write retry handling
+
+Conditional writes with s3 which are retried due to transient issues are not 
idempotent, meaning that a successful PUT can go through but a retry can turn 
the response into a 412. This is an extremely, extremely rare case. It was not 
able to be reproduced, but was observed in production.
+
+To reproduce, we needed a PUT request to succeed on S3 (changing the lock 
owner) while the client never receives the success response (thinks it failed). 
However, with artifical network-level toxics between the client and S3, we face 
a dilemma: blocking/delaying responses either prevents the PUT from completing 
on S3 (timeout toxic stops everything), or the SDK's aggressive retry logic 
eventually succeeds after waiting through the delays (latency toxic). We can't 
surgically "lose" just the response after the PUT completes because TCP 
requires bidirectional communication for the operation to succeed on S3.

Review Comment:
   For such a reproduction, we can still only block the downlink TCP data 
packets and allow downlink TCP ACKs to go through so that PUT request succeeds 
but HTTP response is blocked so the client SDK does not receive it.



##########
rfc/rfc-91/rfc-91.md:
##########
@@ -76,6 +76,8 @@ Once a lock is acquired, a dedicated heartbeat task 
periodically calls renewLock
 - Clock drift: we allow for a maximum of 500ms of clock drift between nodes. A 
requirement of this lock provider is that all writers competing for the same 
lock must be writing from the same cloud provider (AWS/Azure/GCP).
   - This will not be configurable at this time. If a storage-specific 
implementation needs to customize this the config can be added at that time but 
it should never go below 500ms.
 
+See rfc-91-2.md for detailed edge case handling of conditional write retries

Review Comment:
   This sentence needs to be updated by removing `rfc-91-2.md`.



##########
rfc/rfc-91/rfc-91.md:
##########
@@ -76,6 +76,8 @@ Once a lock is acquired, a dedicated heartbeat task 
periodically calls renewLock
 - Clock drift: we allow for a maximum of 500ms of clock drift between nodes. A 
requirement of this lock provider is that all writers competing for the same 
lock must be writing from the same cloud provider (AWS/Azure/GCP).
   - This will not be configurable at this time. If a storage-specific 
implementation needs to customize this the config can be added at that time but 
it should never go below 500ms.
 
+See rfc-91-2.md for detailed edge case handling of conditional write retries

Review Comment:
   Move `## RFC-91-update Nov 2025: Storage LP conditional write retry 
handling` to here and rename the section as `### Retry handling`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to