alamb commented on code in PR #4918:
URL: https://github.com/apache/arrow-rs/pull/4918#discussion_r1362745234
##########
object_store/src/aws/copy.rs:
##########
@@ -39,12 +39,64 @@ pub enum S3CopyIfNotExists {
///
/// [`ObjectStore::copy_if_not_exists`]:
crate::ObjectStore::copy_if_not_exists
Header(String, String),
+ /// The name of a DynamoDB table to use for coordination
+ ///
+ /// Encoded as `dynamodb:<TABLE_NAME>` ignoring whitespace
+ ///
+ /// This will use the same region, credentials and endpoint as configured
for S3
+ ///
+ /// ## Limitations
+ ///
+ /// Only conditional operations, e.g. `copy_if_not_exists` will be
synchronized, and can
+ /// therefore race with non-conditional operations, e.g. `put`, `copy`, or
conditional
+ /// operations performed by writers not configured to synchronize with
DynamoDB.
+ ///
+ /// Workloads making use of this mechanism **must** ensure:
+ ///
+ /// * Conditional and non-conditional operations are not performed on the
same paths
+ /// * Conditional operations are only performed via similarly configured
clients
+ ///
+ /// Additionally as the locking mechanism relies on timeouts to detect
stale locks,
+ /// performance will be poor for systems that frequently rewrite the same
path, instead
+ /// being optimised for systems that primarily create files with paths
never used before.
+ ///
+ /// ## Locking Protocol
+ ///
+ /// The DynamoDB schema is as follows:
+ ///
+ /// * A string hash key named `"key"`
+ /// * A numeric [TTL] attribute named `"ttl"`
+ /// * A numeric attribute named `"generation"`
+ ///
+ /// The lock procedure is as follows:
+ ///
+ /// * Error if file exists in S3
+ /// * Create a corresponding record in DynamoDB with the path as the
`"key"`
+ /// * On Success: Create object in S3
+ /// * On Conflict:
+ /// * Periodically check if file exists in S3
+ /// * After a 60 second timeout attempt to "claim" the lock by
incrementing `"generation"`
+ /// * GOTO start
+ ///
+ /// This is inspired by the [DynamoDB Lock Client] but simplified for the
more limited
+ /// requirements of synchronizing object storage.
+ ///
+ /// The major changes are:
+ ///
+ /// * Uses a monotonic generation count instead of a UUID rvn
Review Comment:
I was thinking collision of concurrent writers (different processes)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]