tustvold opened a new issue, #84:
URL: https://github.com/apache/arrow-rs-object-store/issues/84

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   
   Currently streaming uploads are supported by `ObjectStore::put_multipart`. 
This returns a `AsyncWrite`, which provides a push-based interface for writing 
data.
   
   However, this approach is not without issue:
   
   * No obvious way to return PutResult for parts or the final object - 
https://github.com/apache/arrow-rs/issues/5443
   * No obvious way to retry uploading of a single part - 
https://github.com/apache/arrow-rs/issues/5437
   * Unclear poisoning behaviour - 
https://github.com/apache/arrow-rs/issues/5437
   * Cannot support resuming uploads - 
https://github.com/apache/arrow-rs/issues/4961 
https://github.com/apache/arrow-rs/issues/4608
   * No obvious way to support Attributes - 
https://github.com/apache/arrow-rs/issues/5435 apache/arrow-rs#5334
   * AsyncWrite design can easily lead to timeouts - 
https://github.com/apache/arrow-rs/issues/5366 
https://github.com/tokio-rs/tokio/issues/4296
   * The way we implement poll_flush and poll_shutdown is not entirely in 
keeping with the AsyncWrite contract, e.g. poll_flush may not flush all 
buffered data
   * The ecosystem hasn't settled on a single IO trait for AsyncWrite (because 
they all have their own issues) - 
https://github.com/nrc/portable-interoperable/blob/master/io-traits/README.md
   * Data is copied potentially multiple times to/from buffers
   * Parallelism is controlled by the ObjectStore implementation internally 
with no way to control this
   * AsyncWrite is tricky to integrate with synchronous code, despite the fact 
the internal buffering should make it straightforward
   * Cannot easily track upload progress - 
https://github.com/apache/arrow-rs/discussions/5117
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   apache/arrow-rs#4971 added a `MultipartStore` abstraction that more closely 
mirrors the APIs exposed by object stores, avoiding all of the above issues. If 
we could devise a way to implement this interface for `LocalFileSystem` we 
could then "promote" it into the `ObjectStore` trait and deprecate 
put_multipart. This would provide the maximum flexibility to users, whilst 
being in keeping with the objectives of this crate to closely hew to the APIs 
of the stores themselves.
   
   The key observation that makes this possible, is that we already recommend 
`MultiPartStore` be used with fixed size chunks for compatibility with r2, we 
therefore could require this for LocalFilesystem, in turn allowing it to 
support out-of-order / parallel writes as the file offsets can be determined 
from the part index.
   
   
   https://github.com/apache/arrow-rs/pull/5431 and 
https://github.com/apache/arrow-rs/pull/4857 added `BufWriter` and `BufReader` 
and these would be retained to preserve compatibility with the tokio ecosystem 
and provide a more idiomatic API on top of this
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   I briefly considered a put_stream API, however, this doesn't resolve many of 
the above issues
   
   We could also just implement MultipartStore for LocalFilesystem, whilst 
retaining the current `put_multipart`. This would allow downstreams to opt-in 
to the lower level API if they so wished.
   
   We could also modify put_multipart to return something other than 
AsyncWrite, possibly something closer to PutPart
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   
   Many of the stores also support composing objects from others, this might be 
something to consider in this design - 
https://github.com/apache/arrow-rs/issues/4966
   
   FYI @wjones127 @Xuanwo @alamb @roeap 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to