alamb opened a new issue, #23247: URL: https://github.com/apache/datafusion/issues/23247
### Is your feature request related to a problem or challenge? Noticed while reviewing this PR: - https://github.com/apache/datafusion/pull/21882#discussion_r3470735459 [`SpillFile`](https://github.com/apache/datafusion/blob/main/datafusion/execution/src/spill_file.rs) currently has an async-style API for reading spill data: ```rust fn read_stream(&self) -> Result<Pin<Box<dyn Stream<Item = Result<Bytes>> + Send>>>; ``` but writing spill data is synchronous: ```rust fn open_writer(&self) -> Result<Box<dyn SpillWriter>>; pub trait SpillWriter: std::io::Write + Send { fn finish(&mut self) -> Result<()>; } ``` This is somewhat awkward for spill backends that are naturally async, such as remote object stores. For example, when writing spill files to S3, GCS, Azure, or another [`object_store`](https://docs.rs/object_store/latest/object_store/) implementation, uploads are async operations, but the current `SpillWriter` API requires adapting them to a blocking `std::io::Write` interface. I found this while working on an example showing how to write spill files to an object store: - https://github.com/apache/datafusion/pull/23170 ### Describe the solution you'd like Add an async spill writing API so custom spill backends can write data without forcing async storage systems through a synchronous `Write` abstraction. For example, DataFusion could introduce an async writer trait along these lines: ```rust #[async_trait] pub trait AsyncSpillWriter: Send { async fn write_all(&mut self, bytes: Bytes) -> Result<()>; async fn finish(&mut self) -> Result<()>; } ``` or use an existing async I/O trait if there is a good fit. Suggested steps: - Add an async spill writer abstraction to `datafusion-execution` - Add an async open-writer method to `SpillFile` or introduce a new companion trait - Update spill writing paths to use the async API where possible - Keep compatibility for existing local-file spill implementations, likely by adapting local file writes to the async API - Update the object store spill example to use the new async API Acceptance criteria: - Remote/object-store spill implementations do not need to buffer all spill bytes in memory just to bridge from synchronous `Write` to async upload APIs - Existing local disk spilling continues to work - The public API change is documented in the upgrading guide if needed ### Describe alternatives you've considered Keep the current synchronous `SpillWriter` API. This works well for local files, but makes remote spill backends harder to implement efficiently because they must either block on async uploads or buffer data and upload it later. Add only an object-store-specific spill implementation. That would help one backend, but the underlying mismatch is in the spill file abstraction itself. ### Additional context Related PRs / comments: - Original review comment: https://github.com/apache/datafusion/pull/21882#discussion_r3470735459 - Object store spill example: https://github.com/apache/datafusion/pull/23170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
