Rusty Conover created ARROW-17544:
-------------------------------------
Summary: [C++/Python] Add support for S3 Bucket Versioning
Key: ARROW-17544
URL: https://issues.apache.org/jira/browse/ARROW-17544
Project: Apache Arrow
Issue Type: New Feature
Components: C++
Affects Versions: 9.0.0
Reporter: Rusty Conover
Arrow offers a reasonably capable S3 interface, but it lacks support for S3
Buckets that have versioning enabled. For information about what S3 bucket
versioning is, see:
[https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html]
If Arrow is interacting with a bucket where versioning is enabled, there can be
S3 keys that have multiple versions of content stored utilizing the same key
name. At the present moment, Arrow does not have the ability to:
# Access versions of an S3 key rather than just the latest version of an S3
key. There is no ability to specify the VersionId parameter of S3's GetObject
API.
# Report the VersionId created when a new S3 key is uploaded to a bucket.
Along with S3, GCS also supports versioned buckets.
[https://cloud.google.com/storage/docs/object-versioning]
There are a few shortcomings of the Filesystem interface to support remote file
systems that support versioning:
1. The parameters for open_input_stream() and open_input_file() do not easily
lend themselves to adding an additional parameter of "version" because they
would be passed to all other implemented filesystems. Most other file systems
that exist don't actually support versioning.
2. Upon completion of an S3 multipart upload (i.e., close() on an S3FileSystem
output stream), there is not currently a way for the user to determine the
VersionId or ETag of the S3 key that was created. This is important to know
because if there are multiple concurrent writers to S3, it should be possible
to identify the written S3 key.
Proposed solutions to enable S3 Bucket versioning:
1. To allow library callers to read specific versions of an S3 key, extend only
the S3FileSystem interface with two new API calls:
{{open_input_stream_with_version()}}
{{open_input_file_with_version() }}{{}}
Both are like their namesakes from the normal FileSystem interface but take an
additional parameter of a "version," which is a string representation of the
VersionId returned by S3 when the S3 Key is created. If these functions are
called with an empty string for the specified version, the latest version of
the S3 key will be returned.
I'm a bit reluctant to create these specialized functions just on the
S3FileSystem interface, but I also don't think it is appropriate to change
open_input_stream() and open_input_file()'s parameter list for all filesystems
just for functionality that is only implemented by a small number of
filesystems.
2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to
retrieve the metadata about the S3 key that has been written after the stream
has been closed. The metadata will likely include both a VersionId and a value
for ETag.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)