andygrove opened a new issue, #1387:
URL: https://github.com/apache/datafusion-ballista/issues/1387
## Description
The shuffle writer currently uses synchronous file I/O operations
(`std::fs`) inside async functions, which blocks the tokio runtime. This could
impact performance, especially when multiple tasks are executing concurrently
on the same executor.
## Current Implementation
In `ballista/core/src/execution_plans/shuffle_writer.rs` and
`ballista/core/src/utils.rs`, the following synchronous operations are used
within async contexts:
- `std::fs::create_dir_all()` - creating output directories
- `std::fs::File::create()` - creating shuffle output files
- `std::fs::metadata()` - reading file size after write
- `StreamWriter::write()` / `StreamWriter::finish()` - writing data
(synchronous write to file)
## Proposed Optimization
Consider using `tokio::fs` equivalents or wrapping synchronous I/O in
`spawn_blocking`:
### Option 1: Use tokio::fs directly
```rust
use tokio::fs::{self, File};
use tokio::io::BufWriter;
// Directory creation
tokio::fs::create_dir_all(&path).await?;
// File creation
let file = BufWriter::new(File::create(path).await?);
// File metadata
let num_bytes = tokio::fs::metadata(&w.path).await?.len();
```
### Option 2: Use spawn_blocking for StreamWriter
Since Arrow's `StreamWriter` is synchronous, the actual writes would need
`spawn_blocking`:
```rust
let batch_clone = batch.clone();
tokio::task::spawn_blocking(move || {
writer.write(&batch_clone)
}).await??;
```
## Trade-offs to Consider
1. **Complexity**: Async file I/O adds complexity, especially for the
`StreamWriter` which is inherently synchronous
2. **Performance**: The benefit depends on workload - may be more
significant with many concurrent tasks and small batches
3. **Memory**: `spawn_blocking` requires cloning data or careful lifetime
management
4. **Compatibility**: Arrow's IPC writer is synchronous; full async would
require buffering or a different approach
## Benchmarking Needed
Before implementing, it would be valuable to benchmark:
- Current performance with synchronous I/O
- Impact of blocking on concurrent task execution
- Whether `spawn_blocking` overhead is worth the async benefit
## Related
This was identified during a code review of the shuffle writer. See also PR
#1386 which adds `BufWriter` for buffered I/O.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]