[ 
https://issues.apache.org/jira/browse/ARROW-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597176#comment-17597176
 ] 

Carlos O'Ryan commented on ARROW-17544:
---------------------------------------

> 2) GCS has both generation and metageneration, I'm not sure how to handle 
> that nicely in the API. Should we just ignore the metageneration here?

I think you should ignore it for now, yes.  metageneration changes when the 
metadata of an object changes, e.g., when you add new labels, or when you 
change the {{contentType}} attribute.  And FYI, if object versioning is 
enabled, then each version has a different metadata and metageneration.

> 1) I'm lukewarm about adding new method definitions, ... (e.g. `const 
> util::optional<std::string>version& = {}`).

It is not my place to design your API, just ignore me if this idea does not 
work for you. I would go further.  You can expect additional optional 
parameters, e.g., GCS supports conditional operations ("read this object if its 
current generation is N" or "write this object if its current generation is 
M"), I believe S3 supports those too.  If the objects are encrypted with 
customer-supplied encryption keys you need to provide them as part of each 
call.  Some application want to override the retry and backoff policies for 
some operations.

This suggests you need something like a `std::map<..., std::any>> options = {}` 
as the extra parameter.  This may serve as inspiration:

https://github.com/googleapis/google-cloud-cpp/blob/2b1e12f31ec49e55419f5740c406b707e08ca5e1/google/cloud/options.h#L89


> [C++/Python] Add support for S3 Bucket Versioning
> -------------------------------------------------
>
>                 Key: ARROW-17544
>                 URL: https://issues.apache.org/jira/browse/ARROW-17544
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>    Affects Versions: 9.0.0
>            Reporter: Rusty Conover
>            Assignee: Rusty Conover
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Arrow offers a reasonably capable S3 interface, but it lacks support for S3 
> Buckets that have versioning enabled.  For information about what S3 bucket 
> versioning is, see:
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html]
> If Arrow is interacting with a bucket where versioning is enabled, there can 
> be S3 keys that have multiple versions of content stored utilizing the same 
> key name.  At the present moment, Arrow does not have the ability to:
>  # Access versions of an S3 key rather than just the latest version of an S3 
> key.  There is no ability to specify the VersionId parameter of S3's 
> GetObject API.
>  # Report the VersionId created when a new S3 key is uploaded to a bucket.
> Along with S3, GCS also supports versioned buckets.
> [https://cloud.google.com/storage/docs/object-versioning]
> There are a few shortcomings of the Filesystem interface to support remote 
> file systems that support versioning:
> 1. The parameters for open_input_stream() and open_input_file() do not easily 
> lend themselves to adding an additional parameter of "version" because they 
> would be passed to all other implemented filesystems.  Most other file 
> systems that exist don't actually support versioning.
> 2. Upon completion of an S3 multipart upload (i.e., close() on an 
> S3FileSystem output stream), there is not currently a way for the user to 
> determine the VersionId or ETag of the S3 key that was created.  This is 
> important to know because if there are multiple concurrent writers to S3, it 
> should be possible to identify the written S3 key.
> Proposed solutions to enable S3 Bucket versioning:
> 1. To allow library callers to read specific versions of an S3 key, extend 
> only the S3FileSystem interface with two new API calls:
> {{open_input_stream_with_version()}}
> {{open_input_file_with_version()}}
> Both are like their namesakes from the normal FileSystem interface but take 
> an additional parameter of a "version," which is a string representation of 
> the VersionId returned by S3 when the S3 Key is created.  If these functions 
> are called with an empty string for the specified version, the latest version 
> of the S3 key will be returned.
> I'm a bit reluctant to create these specialized functions just on the 
> S3FileSystem interface, but I also don't think it is appropriate to change 
> open_input_stream() and open_input_file()'s parameter list for all 
> filesystems just for functionality that is only implemented by a small number 
> of filesystems.
> 2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to 
> retrieve the metadata about the S3 key that has been written after the stream 
> has been closed.  The metadata will likely include both a VersionId and a 
> value for ETag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to