kosiew opened a new pull request, #16734: URL: https://github.com/apache/datafusion/pull/16734
## Which issue does this PR close? - Closes #16717. ## Rationale for this change Large `RecordBatch`es produced by data sources can degrade performance and limit parallelism in downstream operators. To address this, we introduce a configurable mechanism to automatically split oversized batches into smaller chunks, improving processing granularity and system responsiveness. This feature is particularly helpful for high-throughput ingestion scenarios or when working with uneven batch sizes from sources. ## What changes are included in this PR? - Introduced a new `batch_split_threshold` configuration under `SessionConfig::execution`, with a default value of `8192`. - Implemented a new `BatchSplitStream` wrapper that splits large `RecordBatch`es into smaller ones based on the configured batch size. - Integrated `BatchSplitStream` into `DataSourceExec` when the `batch_split_threshold` is enabled. - Added `SplitMetrics` for tracking the number of times batches are split. - Added extensive unit tests covering split behavior, edge cases (empty batches, exact size matches), and metric recording. - Created a new test file: `datasource_split.rs` and registered it in `mod.rs`. ## Are these changes tested? Yes. The PR includes comprehensive unit tests that cover: - Automatic splitting of large batches - No splitting when the batch size is below the threshold - Edge cases like zero-row and multiple empty batches - Metrics correctness and slicing integrity ## Are there any user-facing changes? Yes. - New configuration option: `batch_split_threshold`, accessible via `SessionConfig::with_batch_split_threshold`. - Users can disable this behavior entirely by setting the threshold to `0`. - Split behavior is internal and does not affect the logical output or schema visible to end users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org