alihan-synnada opened a new pull request, #12969:
URL: https://github.com/apache/datafusion/pull/12969

   ## Which issue does this PR close?
   
   Closes #12633
   
   ## Rationale for this change
   
   A join operation chain can create a RecordBatch whose size is thousands or 
even millions of rows. 
   
   ## What changes are included in this PR?
   
   Adds a new config called `enforce_batch_size_in_joins` that is disabled by 
default. Enabling the config restricts the maximum output batch size of join 
operators to `batch_size`. #12634 is similar but it splits the join indices and 
then builds the output batches, which causes performance issues. This PR splits 
the output batches after the join is processed.
   
   Improves `adjust_indices_by_join_type` performance by optimizing 
`PrimitiveArray` concatenation using `MutableArrayData`
   
   ## Are these changes tested?
   
   Includes unit tests for `BatchSplitter`
   
   ## Are there any user-facing changes?
   
   Users can optionally enable `enforce_batch_size_in_joins` in cases where 
joins cause out-of-memory. No breaking changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to