ding-young opened a new issue, #17595: URL: https://github.com/apache/datafusion/issues/17595
### Is your feature request related to a problem or challenge? Currently, when spilling `RecordBatch`es to disk, datafusion serializes them using Arrow IPC in units defined by the global `batch_size` configuration (number of rows). However, it may be beneficial to decouple the spill batch size from the execution batch size. While large batches are good for vectorized execution, they can cause issues when reading back during multi-level merge (e.g. query failures even with fewer streams). ### Describe the solution you'd like 1. Add a `spill_batch_size` configuration option. (maybe in rows or maybe in bytes unit) 2. Benchmark and validate its effect on I/O throughput and query stability (failures in multi-level merge). ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org