[ 
https://issues.apache.org/jira/browse/ARROW-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rusty Conover updated ARROW-17544:
----------------------------------
    Description: 
Arrow offers a reasonably capable S3 interface, but it lacks support for S3 
Buckets that have versioning enabled.  For information about what S3 bucket 
versioning is, see:

[https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html]

If Arrow is interacting with a bucket where versioning is enabled, there can be 
S3 keys that have multiple versions of content stored utilizing the same key 
name.  At the present moment, Arrow does not have the ability to:
 # Access versions of an S3 key rather than just the latest version of an S3 
key.  There is no ability to specify the VersionId parameter of S3's GetObject 
API.
 # Report the VersionId created when a new S3 key is uploaded to a bucket.

Along with S3, GCS also supports versioned buckets.

[https://cloud.google.com/storage/docs/object-versioning]

There are a few shortcomings of the Filesystem interface to support remote file 
systems that support versioning:

1. The parameters for open_input_stream() and open_input_file() do not easily 
lend themselves to adding an additional parameter of "version" because they 
would be passed to all other implemented filesystems.  Most other file systems 
that exist don't actually support versioning.

2. Upon completion of an S3 multipart upload (i.e., close() on an S3FileSystem 
output stream), there is not currently a way for the user to determine the 
VersionId or ETag of the S3 key that was created.  This is important to know 
because if there are multiple concurrent writers to S3, it should be possible 
to identify the written S3 key.

Proposed solutions to enable S3 Bucket versioning:

1. To allow library callers to read specific versions of an S3 key, extend only 
the S3FileSystem interface with two new API calls:

{{open_input_stream_with_version()}}

{{open_input_file_with_version()}}

Both are like their namesakes from the normal FileSystem interface but take an 
additional parameter of a "version," which is a string representation of the 
VersionId returned by S3 when the S3 Key is created.  If these functions are 
called with an empty string for the specified version, the latest version of 
the S3 key will be returned.

I'm a bit reluctant to create these specialized functions just on the 
S3FileSystem interface, but I also don't think it is appropriate to change 
open_input_stream() and open_input_file()'s parameter list for all filesystems 
just for functionality that is only implemented by a small number of 
filesystems.

2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to 
retrieve the metadata about the S3 key that has been written after the stream 
has been closed.  The metadata will likely include both a VersionId and a value 
for ETag.

  was:
Arrow offers a reasonably capable S3 interface, but it lacks support for S3 
Buckets that have versioning enabled.  For information about what S3 bucket 
versioning is, see:

[https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html]

If Arrow is interacting with a bucket where versioning is enabled, there can be 
S3 keys that have multiple versions of content stored utilizing the same key 
name.  At the present moment, Arrow does not have the ability to:
 # Access versions of an S3 key rather than just the latest version of an S3 
key.  There is no ability to specify the VersionId parameter of S3's GetObject 
API.
 # Report the VersionId created when a new S3 key is uploaded to a bucket.

Along with S3, GCS also supports versioned buckets.

[https://cloud.google.com/storage/docs/object-versioning]

There are a few shortcomings of the Filesystem interface to support remote file 
systems that support versioning:

1. The parameters for open_input_stream() and open_input_file() do not easily 
lend themselves to adding an additional parameter of "version" because they 
would be passed to all other implemented filesystems.  Most other file systems 
that exist don't actually support versioning.

2. Upon completion of an S3 multipart upload (i.e., close() on an S3FileSystem 
output stream), there is not currently a way for the user to determine the 
VersionId or ETag of the S3 key that was created.  This is important to know 
because if there are multiple concurrent writers to S3, it should be possible 
to identify the written S3 key.

Proposed solutions to enable S3 Bucket versioning:

1. To allow library callers to read specific versions of an S3 key, extend only 
the S3FileSystem interface with two new API calls:

{{open_input_stream_with_version()}}

{{open_input_file_with_version() }}{{}}

Both are like their namesakes from the normal FileSystem interface but take an 
additional parameter of a "version," which is a string representation of the 
VersionId returned by S3 when the S3 Key is created.  If these functions are 
called with an empty string for the specified version, the latest version of 
the S3 key will be returned.

I'm a bit reluctant to create these specialized functions just on the 
S3FileSystem interface, but I also don't think it is appropriate to change 
open_input_stream() and open_input_file()'s parameter list for all filesystems 
just for functionality that is only implemented by a small number of 
filesystems.

2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to 
retrieve the metadata about the S3 key that has been written after the stream 
has been closed.  The metadata will likely include both a VersionId and a value 
for ETag.


> [C++/Python] Add support for S3 Bucket Versioning
> -------------------------------------------------
>
>                 Key: ARROW-17544
>                 URL: https://issues.apache.org/jira/browse/ARROW-17544
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>    Affects Versions: 9.0.0
>            Reporter: Rusty Conover
>            Priority: Major
>
> Arrow offers a reasonably capable S3 interface, but it lacks support for S3 
> Buckets that have versioning enabled.  For information about what S3 bucket 
> versioning is, see:
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html]
> If Arrow is interacting with a bucket where versioning is enabled, there can 
> be S3 keys that have multiple versions of content stored utilizing the same 
> key name.  At the present moment, Arrow does not have the ability to:
>  # Access versions of an S3 key rather than just the latest version of an S3 
> key.  There is no ability to specify the VersionId parameter of S3's 
> GetObject API.
>  # Report the VersionId created when a new S3 key is uploaded to a bucket.
> Along with S3, GCS also supports versioned buckets.
> [https://cloud.google.com/storage/docs/object-versioning]
> There are a few shortcomings of the Filesystem interface to support remote 
> file systems that support versioning:
> 1. The parameters for open_input_stream() and open_input_file() do not easily 
> lend themselves to adding an additional parameter of "version" because they 
> would be passed to all other implemented filesystems.  Most other file 
> systems that exist don't actually support versioning.
> 2. Upon completion of an S3 multipart upload (i.e., close() on an 
> S3FileSystem output stream), there is not currently a way for the user to 
> determine the VersionId or ETag of the S3 key that was created.  This is 
> important to know because if there are multiple concurrent writers to S3, it 
> should be possible to identify the written S3 key.
> Proposed solutions to enable S3 Bucket versioning:
> 1. To allow library callers to read specific versions of an S3 key, extend 
> only the S3FileSystem interface with two new API calls:
> {{open_input_stream_with_version()}}
> {{open_input_file_with_version()}}
> Both are like their namesakes from the normal FileSystem interface but take 
> an additional parameter of a "version," which is a string representation of 
> the VersionId returned by S3 when the S3 Key is created.  If these functions 
> are called with an empty string for the specified version, the latest version 
> of the S3 key will be returned.
> I'm a bit reluctant to create these specialized functions just on the 
> S3FileSystem interface, but I also don't think it is appropriate to change 
> open_input_stream() and open_input_file()'s parameter list for all 
> filesystems just for functionality that is only implemented by a small number 
> of filesystems.
> 2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to 
> retrieve the metadata about the S3 key that has been written after the stream 
> has been closed.  The metadata will likely include both a VersionId and a 
> value for ETag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to