tustvold opened a new issue, #5458:
URL: https://github.com/apache/arrow-rs/issues/5458

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   
   Currently streaming uploads are supported by `ObjectStore::put_multipart`. 
This returns a `AsyncWrite`, which provides a push-based interface for writing 
data.
   
   However, this approach is not without issue:
   
   * No obvious way to return PutResult for parts or the final object - 
https://github.com/apache/arrow-rs/issues/5443
   * No obvious way to retry uploading of a single part - 
https://github.com/apache/arrow-rs/issues/5437
   * Unclear poisoning behaviour - 
https://github.com/apache/arrow-rs/issues/5437
   * Cannot support resuming uploads - 
https://github.com/apache/arrow-rs/issues/4961 
https://github.com/apache/arrow-rs/issues/4608
   * No obvious way to support Attributes - 
https://github.com/apache/arrow-rs/issues/5435 #5334
   * AsyncWrite design can easily lead to timeouts - 
https://github.com/apache/arrow-rs/issues/5366 
https://github.com/tokio-rs/tokio/issues/4296
   * The way we implement poll_flush and poll_shutdown is not entirely in 
keeping with the AsyncWrite contract, e.g. poll_flush may not flush all 
buffered data
   * The ecosystem hasn't settled on a single IO trait for AsyncWrite (because 
they all have their own issues)
   * Data is copied potentially multiple times to/from buffers
   * Parallelism is controlled by the ObjectStore implementation internally 
with no way to control this
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   #4971 added a `MultipartStore` abstraction that more closely mirrors the 
APIs exposed by object stores, avoiding all of the above issues. If we could 
devise a way to implement this interface for `LocalFileSystem` we could then 
"promote" it into the `ObjectStore` trait and deprecate put_multipart. This 
would provide the maximum flexibility to users, whilst being in keeping with 
the objectives of this crate to closely hew to the APIs of the stores 
themselves.
   
   The key observation that makes this possible, is that we already recommend 
`MultiPartStore` be used with fixed size chunks for compatibility with r2, we 
therefore could require this for LocalFilesystem, in turn allowing it to 
support out-of-order / parallel writes as the file offsets can be determined 
from the part index.
   
   We could then add a `MultipartUpload` struct that provides a more idiomatic 
API for uploading chunks of data
   
   https://github.com/apache/arrow-rs/pull/5431 and 
https://github.com/apache/arrow-rs/pull/4857 added `BufWriter` and `BufReader` 
and these would be retained to preserve compatibility with the tokio ecosystem.
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   I briefly considered a put_stream API, however, this doesn't resolve many of 
the above issues
   
   We could also just implement MultipartStore for LocalFilesystem, whilst 
retaining the current `put_multipart`. This would allow downstreams to opt-in 
to the lower level API if they so wished.
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   
   Many of the stores also support composing objects from others, this might be 
something to consider in this design - 
https://github.com/apache/arrow-rs/issues/4966
   
   FYI @wjones127 @Xuanwo @alamb @roeap 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to