yihua commented on code in PR #14286: URL: https://github.com/apache/hudi/pull/14286#discussion_r2561611440
########## rfc/rfc-91/rfc-91.md: ########## @@ -130,28 +132,86 @@ We can use the same logic for preconditions with overwrite operations using the ## Test Plan -We can write normal junit tests using testcontainers with GCS and S3 to simulate edge cases and general contention. Further adhoc testing will include the following scenarios: +We can write normal junit tests using testcontainers with GCS and S3 to simulate edge cases and general contention. ### Unit tests We will add some high contention, high usage unit tests that create hundreds of threads to try and acquire locks simultaneously on the testcontainers to simulate load and contention. We can also use thread-unsafe structures like Arraylists to ensure concurrent modifications do not occur. -### High-Frequency Commit and Table Service Test -Run a long-running streaming ingestion process that continuously performs inserts, updates, and deletes. Ensure that frequent commits occur while table services like compaction and clustering operate concurrently. This test will help verify that the lock provider can handle overlapping operations without causing excessive delays or lock contention. +## RFC-91-update Nov 2025: Storage LP conditional write retry handling + +Conditional writes with s3 which are retried due to transient issues are not idempotent, meaning that a successful PUT can go through but a retry can turn the response into a 412. This is an extremely, extremely rare case. It was not able to be reproduced, but was observed in production. + +To reproduce, we needed a PUT request to succeed on S3 (changing the lock owner) while the client never receives the success response (thinks it failed). However, with artifical network-level toxics between the client and S3, we face a dilemma: blocking/delaying responses either prevents the PUT from completing on S3 (timeout toxic stops everything), or the SDK's aggressive retry logic eventually succeeds after waiting through the delays (latency toxic). We can't surgically "lose" just the response after the PUT completes because TCP requires bidirectional communication for the operation to succeed on S3. + + + +The lock provider must assume that any 412 received can indicate a successful write, but the 412 is a transient retry which failed. + +We need to handle this separately in 3 cases: + +### during tryLock +Any check we do upon 412 after attempting to acquire the lock immediately becomes a safety violation. + +#### Why Checking After Is Unsafe + +The key insight is the time window between the write and any verification check: +``` + Time Client A S3 State Client B + ---- -------- -------- -------- + T0 PUT if-none-match:* + T1 File created (A's data) + T2 Returns 412 to A + T3 Receives 412 + T4 Thinks: "I failed" + + !!! IF CLIENT A CHECKS HERE: + T5 GET to verify + + BUT between T1 and T5, what could have happened? + + Scenario 1: Lock expired, Client B acquired it + Acquired lock (B's data) + T5 Reads lock Lock = Client B + Sees Client B's data + + Scenario 2: Lock still has A's data + T5 Reads lock Lock = Client A + Sees own data +``` + +Therefore, we have to live with this bad scenario which will produce dangling locks. Hopefully AWS can fix it. Note, we can add logic improvements to conditionally reject the transaction: + +``` +412 received, waiting for current owner to release the lock +- oh wait, we are the owner! How did this happen? +- Conditionally release the ghost lock and return false from tryLock() + - If we are unable to release the lock that's fine. This is the exact safety violation we are trying to prevent. + - emit a metric that indicates this inconsistency +- This avoids the 5min lock orphaning. +``` + +### during renewLock +Within the renewal section of the storage based lock provider, we assume that any 412 error means someone else has taken the lock. This scenario is irrecoverable, as we do not know the latest lock's ETag. Therefore, we have to handle this in renewLock() + +The best way forward here, is to fetch the lock again after renew lock fails with 412. + +#### If the owner in s3 does not match our owner +The lock provider has failed as two writers acquired the lock and there is nothing we can do other than set off alarm bells and exit the transaction. -### Concurrent SQL and Spark Operations Test Review Comment: Should we keep existing testing plans? ########## rfc/rfc-91/rfc-91.md: ########## @@ -130,28 +132,86 @@ We can use the same logic for preconditions with overwrite operations using the ## Test Plan -We can write normal junit tests using testcontainers with GCS and S3 to simulate edge cases and general contention. Further adhoc testing will include the following scenarios: +We can write normal junit tests using testcontainers with GCS and S3 to simulate edge cases and general contention. ### Unit tests We will add some high contention, high usage unit tests that create hundreds of threads to try and acquire locks simultaneously on the testcontainers to simulate load and contention. We can also use thread-unsafe structures like Arraylists to ensure concurrent modifications do not occur. -### High-Frequency Commit and Table Service Test -Run a long-running streaming ingestion process that continuously performs inserts, updates, and deletes. Ensure that frequent commits occur while table services like compaction and clustering operate concurrently. This test will help verify that the lock provider can handle overlapping operations without causing excessive delays or lock contention. +## RFC-91-update Nov 2025: Storage LP conditional write retry handling + +Conditional writes with s3 which are retried due to transient issues are not idempotent, meaning that a successful PUT can go through but a retry can turn the response into a 412. This is an extremely, extremely rare case. It was not able to be reproduced, but was observed in production. + +To reproduce, we needed a PUT request to succeed on S3 (changing the lock owner) while the client never receives the success response (thinks it failed). However, with artifical network-level toxics between the client and S3, we face a dilemma: blocking/delaying responses either prevents the PUT from completing on S3 (timeout toxic stops everything), or the SDK's aggressive retry logic eventually succeeds after waiting through the delays (latency toxic). We can't surgically "lose" just the response after the PUT completes because TCP requires bidirectional communication for the operation to succeed on S3. Review Comment: For such a reproduction, we can still only block the downlink TCP data packets and allow downlink TCP ACKs to go through so that PUT request succeeds but HTTP response is blocked so the client SDK does not receive it. ########## rfc/rfc-91/rfc-91.md: ########## @@ -76,6 +76,8 @@ Once a lock is acquired, a dedicated heartbeat task periodically calls renewLock - Clock drift: we allow for a maximum of 500ms of clock drift between nodes. A requirement of this lock provider is that all writers competing for the same lock must be writing from the same cloud provider (AWS/Azure/GCP). - This will not be configurable at this time. If a storage-specific implementation needs to customize this the config can be added at that time but it should never go below 500ms. +See rfc-91-2.md for detailed edge case handling of conditional write retries Review Comment: This sentence needs to be updated by removing `rfc-91-2.md`. ########## rfc/rfc-91/rfc-91.md: ########## @@ -76,6 +76,8 @@ Once a lock is acquired, a dedicated heartbeat task periodically calls renewLock - Clock drift: we allow for a maximum of 500ms of clock drift between nodes. A requirement of this lock provider is that all writers competing for the same lock must be writing from the same cloud provider (AWS/Azure/GCP). - This will not be configurable at this time. If a storage-specific implementation needs to customize this the config can be added at that time but it should never go below 500ms. +See rfc-91-2.md for detailed edge case handling of conditional write retries Review Comment: Move `## RFC-91-update Nov 2025: Storage LP conditional write retry handling` to here and rename the section as `### Retry handling`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
