tustvold opened a new issue, #5458: URL: https://github.com/apache/arrow-rs/issues/5458
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** <!-- A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and *why* for this feature, in addition to the *what*) --> Currently streaming uploads are supported by `ObjectStore::put_multipart`. This returns a `AsyncWrite`, which provides a push-based interface for writing data. However, this approach is not without issue: * No obvious way to return PutResult for parts or the final object - https://github.com/apache/arrow-rs/issues/5443 * No obvious way to retry uploading of a single part - https://github.com/apache/arrow-rs/issues/5437 * Unclear poisoning behaviour - https://github.com/apache/arrow-rs/issues/5437 * Cannot support resuming uploads - https://github.com/apache/arrow-rs/issues/4961 https://github.com/apache/arrow-rs/issues/4608 * No obvious way to support Attributes - https://github.com/apache/arrow-rs/issues/5435 #5334 * AsyncWrite design can easily lead to timeouts - https://github.com/apache/arrow-rs/issues/5366 https://github.com/tokio-rs/tokio/issues/4296 * The way we implement poll_flush and poll_shutdown is not entirely in keeping with the AsyncWrite contract, e.g. poll_flush may not flush all buffered data * The ecosystem hasn't settled on a single IO trait for AsyncWrite (because they all have their own issues) * Data is copied potentially multiple times to/from buffers * Parallelism is controlled by the ObjectStore implementation internally with no way to control this **Describe the solution you'd like** <!-- A clear and concise description of what you want to happen. --> #4971 added a `MultipartStore` abstraction that more closely mirrors the APIs exposed by object stores, avoiding all of the above issues. If we could devise a way to implement this interface for `LocalFileSystem` we could then "promote" it into the `ObjectStore` trait and deprecate put_multipart. This would provide the maximum flexibility to users, whilst being in keeping with the objectives of this crate to closely hew to the APIs of the stores themselves. The key observation that makes this possible, is that we already recommend `MultiPartStore` be used with fixed size chunks for compatibility with r2, we therefore could require this for LocalFilesystem, in turn allowing it to support out-of-order / parallel writes as the file offsets can be determined from the part index. We could then add a `MultipartUpload` struct that provides a more idiomatic API for uploading chunks of data https://github.com/apache/arrow-rs/pull/5431 and https://github.com/apache/arrow-rs/pull/4857 added `BufWriter` and `BufReader` and these would be retained to preserve compatibility with the tokio ecosystem. **Describe alternatives you've considered** <!-- A clear and concise description of any alternative solutions or features you've considered. --> I briefly considered a put_stream API, however, this doesn't resolve many of the above issues We could also just implement MultipartStore for LocalFilesystem, whilst retaining the current `put_multipart`. This would allow downstreams to opt-in to the lower level API if they so wished. **Additional context** <!-- Add any other context or screenshots about the feature request here. --> Many of the stores also support composing objects from others, this might be something to consider in this design - https://github.com/apache/arrow-rs/issues/4966 FYI @wjones127 @Xuanwo @alamb @roeap -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
