andygrove opened a new pull request, #1386:
URL: https://github.com/apache/datafusion-ballista/pull/1386
## Summary
Two improvements to the shuffle writer:
1. **Add BufWriter for buffered file I/O**
- Wrapping `File` with `BufWriter` reduces syscalls when writing multiple
small batches to shuffle files
- Applied to both the hash-partitioned shuffle path in
`ShuffleWriterExec` and the `write_stream_to_disk` utility
2. **Fix file size read before writer finish (bug fix)**
- Previously, `fs::metadata()` was called before `writer.finish()`, which
could report incorrect file sizes
- Data may not have been fully flushed to disk, especially now that
`BufWriter` is used
- Fixed by swapping the order: call `finish()` first, then read the file
size
## Note
This PR was created with AI assistance using [Claude
Code](https://claude.ai/code). All changes were reviewed and approved by a
human maintainer.
## Test plan
- [x] Existing shuffle_writer tests pass (`cargo test -p ballista-core
shuffle_writer`)
- [ ] Manual testing with distributed queries to verify shuffle performance
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]