alihan-synnada opened a new pull request, #12969: URL: https://github.com/apache/datafusion/pull/12969
## Which issue does this PR close? Closes #12633 ## Rationale for this change A join operation chain can create a RecordBatch whose size is thousands or even millions of rows. ## What changes are included in this PR? Adds a new config called `enforce_batch_size_in_joins` that is disabled by default. Enabling the config restricts the maximum output batch size of join operators to `batch_size`. #12634 is similar but it splits the join indices and then builds the output batches, which causes performance issues. This PR splits the output batches after the join is processed. Improves `adjust_indices_by_join_type` performance by optimizing `PrimitiveArray` concatenation using `MutableArrayData` ## Are these changes tested? Includes unit tests for `BatchSplitter` ## Are there any user-facing changes? Users can optionally enable `enforce_batch_size_in_joins` in cases where joins cause out-of-memory. No breaking changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
