andygrove opened a new pull request, #3941:
URL: https://github.com/apache/datafusion-comet/pull/3941

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   The buffered shuffle writer (`MultiPartitionShuffleRepartitioner`) 
accumulates all input batches in memory before writing partitioned output 
during `shuffle_write`. When given ample memory (or no memory limit), it 
buffers everything — which paradoxically *degrades* throughput because 
`interleave_record_batch` has poor cache locality when working over a huge 
buffer of batches.
   
   Benchmarks show throughput drops as more memory is available:
   
   | Memory Limit | Throughput | Peak RSS |
   |---|---|---|
   | 512MB | 2.33 M/s | 2.2 GiB |
   | 1GB | 2.33 M/s | 2.8 GiB |
   | 4GB | 2.22 M/s | 7.3 GiB |
   | unlimited | 1.40 M/s | 22.6 GiB |
   
   ## What changes are included in this PR?
   
   A new config `spark.comet.exec.shuffle.batchSpillLimit` (default: 100) that 
triggers spilling after a fixed number of buffered input batches, regardless of 
memory availability. This keeps the working set bounded and maintains good 
cache locality during the interleave phase.
   
   The change is minimal — a single condition added to 
`buffer_partitioned_batch_may_spill`:
   
   ```rust
   let over_batch_limit = self.batch_spill_limit > 0
       && self.buffered_batches.len() >= self.batch_spill_limit;
   if over_batch_limit || self.reservation.try_grow(mem_growth).is_err() {
       self.spill()?;
   }
   ```
   
   Setting `batchSpillLimit=0` disables the threshold, preserving the existing 
behavior.
   
   **Benchmark results (200 partitions, 100M rows, unlimited memory):**
   
   | Config | Throughput | Peak RSS | Spills |
   |--------|-----------|----------|--------|
   | No limit (current) | 1.38 M/s | 22.6 GiB | 0 |
   | batchSpillLimit=100 | **2.43 M/s** | **1.7 GiB** | 123 |
   
   **+76% throughput, -93% memory usage.** No regression when memory is already 
constrained.
   
   ## How are these changes tested?
   
   - All 19 existing shuffle tests pass (batch_spill_limit=0 in tests preserves 
existing behavior)
   - Benchmarked with TPC-H SF100 lineitem data across multiple configurations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to