[I] Add separate `spill_batch_size configuration` [datafusion]

via GitHub Tue, 16 Sep 2025 06:28:51 -0700


ding-young opened a new issue, #17595:
URL: https://github.com/apache/datafusion/issues/17595


   ### Is your feature request related to a problem or challenge?
   
   Currently, when spilling `RecordBatch`es to disk, datafusion serializes them 
using Arrow IPC in units defined by the global `batch_size` configuration 
(number of rows). However, it may be beneficial to decouple the spill batch 
size from the execution batch size. While large batches are good for vectorized 
execution, they can cause issues when reading back during multi-level merge 
(e.g. query failures even with fewer streams).
   
   ### Describe the solution you'd like
   
   1. Add a `spill_batch_size` configuration option. (maybe in rows or maybe in 
bytes unit) 
   2. Benchmark and validate its effect on I/O throughput and query stability 
(failures in multi-level merge).
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Add separate `spill_batch_size configuration` [datafusion]

Reply via email to