andygrove opened a new issue, #1387:
URL: https://github.com/apache/datafusion-ballista/issues/1387

   ## Description
   
   The shuffle writer currently uses synchronous file I/O operations 
(`std::fs`) inside async functions, which blocks the tokio runtime. This could 
impact performance, especially when multiple tasks are executing concurrently 
on the same executor.
   
   ## Current Implementation
   
   In `ballista/core/src/execution_plans/shuffle_writer.rs` and 
`ballista/core/src/utils.rs`, the following synchronous operations are used 
within async contexts:
   
   - `std::fs::create_dir_all()` - creating output directories
   - `std::fs::File::create()` - creating shuffle output files  
   - `std::fs::metadata()` - reading file size after write
   - `StreamWriter::write()` / `StreamWriter::finish()` - writing data 
(synchronous write to file)
   
   ## Proposed Optimization
   
   Consider using `tokio::fs` equivalents or wrapping synchronous I/O in 
`spawn_blocking`:
   
   ### Option 1: Use tokio::fs directly
   ```rust
   use tokio::fs::{self, File};
   use tokio::io::BufWriter;
   
   // Directory creation
   tokio::fs::create_dir_all(&path).await?;
   
   // File creation
   let file = BufWriter::new(File::create(path).await?);
   
   // File metadata
   let num_bytes = tokio::fs::metadata(&w.path).await?.len();
   ```
   
   ### Option 2: Use spawn_blocking for StreamWriter
   Since Arrow's `StreamWriter` is synchronous, the actual writes would need 
`spawn_blocking`:
   ```rust
   let batch_clone = batch.clone();
   tokio::task::spawn_blocking(move || {
       writer.write(&batch_clone)
   }).await??;
   ```
   
   ## Trade-offs to Consider
   
   1. **Complexity**: Async file I/O adds complexity, especially for the 
`StreamWriter` which is inherently synchronous
   2. **Performance**: The benefit depends on workload - may be more 
significant with many concurrent tasks and small batches
   3. **Memory**: `spawn_blocking` requires cloning data or careful lifetime 
management
   4. **Compatibility**: Arrow's IPC writer is synchronous; full async would 
require buffering or a different approach
   
   ## Benchmarking Needed
   
   Before implementing, it would be valuable to benchmark:
   - Current performance with synchronous I/O
   - Impact of blocking on concurrent task execution
   - Whether `spawn_blocking` overhead is worth the async benefit
   
   ## Related
   
   This was identified during a code review of the shuffle writer. See also PR 
#1386 which adds `BufWriter` for buffered I/O.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to