Re: [PR] Implement `copy_if_not_exist` for `AmazonS3` using DynamoDB (#4880) [arrow-rs]

via GitHub Tue, 17 Oct 2023 13:42:14 -0700


alamb commented on code in PR #4918:
URL: https://github.com/apache/arrow-rs/pull/4918#discussion_r1362745234



##########
object_store/src/aws/copy.rs:
##########
@@ -39,12 +39,64 @@ pub enum S3CopyIfNotExists {
     ///
     /// [`ObjectStore::copy_if_not_exists`]: 
crate::ObjectStore::copy_if_not_exists
     Header(String, String),
+    /// The name of a DynamoDB table to use for coordination
+    ///
+    /// Encoded as `dynamodb:<TABLE_NAME>` ignoring whitespace
+    ///
+    /// This will use the same region, credentials and endpoint as configured 
for S3
+    ///
+    /// ## Limitations
+    ///
+    /// Only conditional operations, e.g. `copy_if_not_exists` will be 
synchronized, and can
+    /// therefore race with non-conditional operations, e.g. `put`, `copy`, or 
conditional
+    /// operations performed by writers not configured to synchronize with 
DynamoDB.
+    ///
+    /// Workloads making use of this mechanism **must** ensure:
+    ///
+    /// * Conditional and non-conditional operations are not performed on the 
same paths
+    /// * Conditional operations are only performed via similarly configured 
clients
+    ///
+    /// Additionally as the locking mechanism relies on timeouts to detect 
stale locks,
+    /// performance will be poor for systems that frequently rewrite the same 
path, instead
+    /// being optimised for systems that primarily create files with paths 
never used before.
+    ///
+    /// ## Locking Protocol
+    ///
+    /// The DynamoDB schema is as follows:
+    ///
+    /// * A string hash key named `"key"`
+    /// * A numeric [TTL] attribute named `"ttl"`
+    /// * A numeric attribute named `"generation"`
+    ///
+    /// The lock procedure is as follows:
+    ///
+    /// * Error if file exists in S3
+    /// * Create a corresponding record in DynamoDB with the path as the 
`"key"`
+    ///     * On Success: Create object in S3
+    ///     * On Conflict:
+    ///         * Periodically check if file exists in S3
+    ///         * After a 60 second timeout attempt to "claim" the lock by 
incrementing `"generation"`
+    ///         * GOTO start
+    ///
+    /// This is inspired by the [DynamoDB Lock Client] but simplified for the 
more limited
+    /// requirements of synchronizing object storage.
+    ///
+    /// The major changes are:
+    ///
+    /// * Uses a monotonic generation count instead of a UUID rvn

Review Comment:
   I was thinking collision of concurrent writers (different processes)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement `copy_if_not_exist` for `AmazonS3` using DynamoDB (#4880) [arrow-rs]

Reply via email to