tustvold commented on code in PR #4918:
URL: https://github.com/apache/arrow-rs/pull/4918#discussion_r1361218038


##########
object_store/src/aws/copy.rs:
##########
@@ -39,12 +39,64 @@ pub enum S3CopyIfNotExists {
     ///
     /// [`ObjectStore::copy_if_not_exists`]: 
crate::ObjectStore::copy_if_not_exists
     Header(String, String),
+    /// The name of a DynamoDB table to use for coordination
+    ///
+    /// Encoded as `dynamodb:<TABLE_NAME>` ignoring whitespace
+    ///
+    /// This will use the same region, credentials and endpoint as configured 
for S3
+    ///
+    /// ## Limitations
+    ///
+    /// Only conditional operations, e.g. `copy_if_not_exists` will be 
synchronized, and can
+    /// therefore race with non-conditional operations, e.g. `put`, `copy`, or 
conditional
+    /// operations performed by writers not configured to synchronize with 
DynamoDB.
+    ///
+    /// Workloads making use of this mechanism **must** ensure:
+    ///
+    /// * Conditional and non-conditional operations are not performed on the 
same paths
+    /// * Conditional operations are only performed via similarly configured 
clients
+    ///
+    /// Additionally as the locking mechanism relies on timeouts to detect 
stale locks,
+    /// performance will be poor for systems that frequently rewrite the same 
path, instead
+    /// being optimised for systems that primarily create files with paths 
never used before.
+    ///
+    /// ## Locking Protocol
+    ///
+    /// The DynamoDB schema is as follows:
+    ///
+    /// * A string hash key named `"key"`
+    /// * A numeric [TTL] attribute named `"ttl"`
+    /// * A numeric attribute named `"generation"`
+    ///
+    /// The lock procedure is as follows:
+    ///
+    /// * Error if file exists in S3
+    /// * Create a corresponding record in DynamoDB with the path as the 
`"key"`
+    ///     * On Success: Create object in S3
+    ///     * On Conflict:
+    ///         * Periodically check if file exists in S3
+    ///         * After a 60 second timeout attempt to "claim" the lock by 
incrementing `"generation"`
+    ///         * GOTO start
+    ///
+    /// This is inspired by the [DynamoDB Lock Client] but simplified for the 
more limited
+    /// requirements of synchronizing object storage.
+    ///
+    /// The major changes are:
+    ///
+    /// * Uses a monotonic generation count instead of a UUID rvn

Review Comment:
   A UUID approach would still collide, that's the whole purpose :smile: 
   
   A monotonic generation count is beneficial as it can also act as a fencing 
token, as a higher generation should win over a lower generation, admittedly 
this isn't used here but is generally good practice. Other less good reasons 
are a UUID is more expensive to generate, adds a dependency, and has a more 
expensive encoding.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to