[
https://issues.apache.org/jira/browse/ARROW-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597153#comment-17597153
]
Antoine Pitrou commented on ARROW-17544:
----------------------------------------
Two notes:
1) I'm lukewarm about adding new method definitions, as it will result in
combinatorial explosion (we also need async versions of these APIs as well as
FileInfo-taking versions). One possibility would be to add the version as an
additional optional argument (e.g. {{const util::optional<std::string>version&
= {} }}).
2) GCS has both generation and metageneration, I'm not sure how to handle that
nicely in the API. Should we just ignore the metageneration here? [~coryan]
> [C++/Python] Add support for S3 Bucket Versioning
> -------------------------------------------------
>
> Key: ARROW-17544
> URL: https://issues.apache.org/jira/browse/ARROW-17544
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Affects Versions: 9.0.0
> Reporter: Rusty Conover
> Assignee: Rusty Conover
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Arrow offers a reasonably capable S3 interface, but it lacks support for S3
> Buckets that have versioning enabled. For information about what S3 bucket
> versioning is, see:
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html]
> If Arrow is interacting with a bucket where versioning is enabled, there can
> be S3 keys that have multiple versions of content stored utilizing the same
> key name. At the present moment, Arrow does not have the ability to:
> # Access versions of an S3 key rather than just the latest version of an S3
> key. There is no ability to specify the VersionId parameter of S3's
> GetObject API.
> # Report the VersionId created when a new S3 key is uploaded to a bucket.
> Along with S3, GCS also supports versioned buckets.
> [https://cloud.google.com/storage/docs/object-versioning]
> There are a few shortcomings of the Filesystem interface to support remote
> file systems that support versioning:
> 1. The parameters for open_input_stream() and open_input_file() do not easily
> lend themselves to adding an additional parameter of "version" because they
> would be passed to all other implemented filesystems. Most other file
> systems that exist don't actually support versioning.
> 2. Upon completion of an S3 multipart upload (i.e., close() on an
> S3FileSystem output stream), there is not currently a way for the user to
> determine the VersionId or ETag of the S3 key that was created. This is
> important to know because if there are multiple concurrent writers to S3, it
> should be possible to identify the written S3 key.
> Proposed solutions to enable S3 Bucket versioning:
> 1. To allow library callers to read specific versions of an S3 key, extend
> only the S3FileSystem interface with two new API calls:
> {{open_input_stream_with_version()}}
> {{open_input_file_with_version()}}
> Both are like their namesakes from the normal FileSystem interface but take
> an additional parameter of a "version," which is a string representation of
> the VersionId returned by S3 when the S3 Key is created. If these functions
> are called with an empty string for the specified version, the latest version
> of the S3 key will be returned.
> I'm a bit reluctant to create these specialized functions just on the
> S3FileSystem interface, but I also don't think it is appropriate to change
> open_input_stream() and open_input_file()'s parameter list for all
> filesystems just for functionality that is only implemented by a small number
> of filesystems.
> 2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to
> retrieve the metadata about the S3 key that has been written after the stream
> has been closed. The metadata will likely include both a VersionId and a
> value for ETag.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)