kosiew opened a new pull request, #16734:
URL: https://github.com/apache/datafusion/pull/16734

   ## Which issue does this PR close?
   
   - Closes #16717.
   
   ## Rationale for this change
   
   Large `RecordBatch`es produced by data sources can degrade performance and 
limit parallelism in downstream operators. To address this, we introduce a 
configurable mechanism to automatically split oversized batches into smaller 
chunks, improving processing granularity and system responsiveness. This 
feature is particularly helpful for high-throughput ingestion scenarios or when 
working with uneven batch sizes from sources.
   
   ## What changes are included in this PR?
   
   - Introduced a new `batch_split_threshold` configuration under 
`SessionConfig::execution`, with a default value of `8192`.
   - Implemented a new `BatchSplitStream` wrapper that splits large 
`RecordBatch`es into smaller ones based on the configured batch size.
   - Integrated `BatchSplitStream` into `DataSourceExec` when the 
`batch_split_threshold` is enabled.
   - Added `SplitMetrics` for tracking the number of times batches are split.
   - Added extensive unit tests covering split behavior, edge cases (empty 
batches, exact size matches), and metric recording.
   - Created a new test file: `datasource_split.rs` and registered it in 
`mod.rs`.
   
   ## Are these changes tested?
   
   Yes. The PR includes comprehensive unit tests that cover:
   
   - Automatic splitting of large batches
   - No splitting when the batch size is below the threshold
   - Edge cases like zero-row and multiple empty batches
   - Metrics correctness and slicing integrity
   
   ## Are there any user-facing changes?
   
   Yes.
   
   - New configuration option: `batch_split_threshold`, accessible via 
`SessionConfig::with_batch_split_threshold`.
   - Users can disable this behavior entirely by setting the threshold to `0`.
   - Split behavior is internal and does not affect the logical output or 
schema visible to end users.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to