alamb opened a new issue, #23247:
URL: https://github.com/apache/datafusion/issues/23247

   ### Is your feature request related to a problem or challenge?
   
   Noticed while reviewing this PR:
   
   - https://github.com/apache/datafusion/pull/21882#discussion_r3470735459
   
   
[`SpillFile`](https://github.com/apache/datafusion/blob/main/datafusion/execution/src/spill_file.rs)
 currently has an async-style API for reading spill data:
   
   ```rust
   fn read_stream(&self) -> Result<Pin<Box<dyn Stream<Item = Result<Bytes>> + 
Send>>>;
   ```
   
   but writing spill data is synchronous:
   
   ```rust
   fn open_writer(&self) -> Result<Box<dyn SpillWriter>>;
   
   pub trait SpillWriter: std::io::Write + Send {
       fn finish(&mut self) -> Result<()>;
   }
   ```
   
   This is somewhat awkward for spill backends that are naturally async, such 
as remote object stores. For example, when writing spill files to S3, GCS, 
Azure, or another 
[`object_store`](https://docs.rs/object_store/latest/object_store/) 
implementation, uploads are async operations, but the current `SpillWriter` API 
requires adapting them to a blocking `std::io::Write` interface.
   
   I found this while working on an example showing how to write spill files to 
an object store:
   
   - https://github.com/apache/datafusion/pull/23170
   
   ### Describe the solution you'd like
   
   Add an async spill writing API so custom spill backends can write data 
without forcing async storage systems through a synchronous `Write` abstraction.
   
   For example, DataFusion could introduce an async writer trait along these 
lines:
   
   ```rust
   #[async_trait]
   pub trait AsyncSpillWriter: Send {
       async fn write_all(&mut self, bytes: Bytes) -> Result<()>;
       async fn finish(&mut self) -> Result<()>;
   }
   ```
   
   or use an existing async I/O trait if there is a good fit.
   
   Suggested steps:
   
   - Add an async spill writer abstraction to `datafusion-execution`
   - Add an async open-writer method to `SpillFile` or introduce a new 
companion trait
   - Update spill writing paths to use the async API where possible
   - Keep compatibility for existing local-file spill implementations, likely 
by adapting local file writes to the async API
   - Update the object store spill example to use the new async API
   
   Acceptance criteria:
   
   - Remote/object-store spill implementations do not need to buffer all spill 
bytes in memory just to bridge from synchronous `Write` to async upload APIs
   - Existing local disk spilling continues to work
   - The public API change is documented in the upgrading guide if needed
   
   ### Describe alternatives you've considered
   
   Keep the current synchronous `SpillWriter` API. This works well for local 
files, but makes remote spill backends harder to implement efficiently because 
they must either block on async uploads or buffer data and upload it later.
   
   Add only an object-store-specific spill implementation. That would help one 
backend, but the underlying mismatch is in the spill file abstraction itself.
   
   ### Additional context
   
   Related PRs / comments:
   
   - Original review comment: 
https://github.com/apache/datafusion/pull/21882#discussion_r3470735459
   - Object store spill example: https://github.com/apache/datafusion/pull/23170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to