Re: [PR] feat: support push batch direct to completed as a public API and add biggest coalesce batch support [arrow-rs]

via GitHub Sat, 16 Aug 2025 08:02:53 -0700


zhuqi-lucas commented on code in PR #8146:
URL: https://github.com/apache/arrow-rs/pull/8146#discussion_r2280462502



##########
arrow-select/src/coalesce.rs:
##########
@@ -236,6 +262,13 @@ impl BatchCoalescer {
     /// assert_eq!(completed_batch, expected_batch);
     /// ```
     pub fn push_batch(&mut self, batch: RecordBatch) -> Result<(), ArrowError> 
{
+        if let Some(limit) = self.biggest_coalesce_batch_size {

Review Comment:
   > Also, thinking about it, we can as well prevent flushing in progress array 
if we do not require the output to be sorted (so smaller in progress batches 
will still be combined into a output batch of the target size).
   
   Great point @Dandandan  , i agree this is a todo, i was also thinking, even 
we skip coalesce large batch, we will still output smaller batches before the 
large batch. If we don't need sort, we can combine all smaller batches, but the 
logic may need more investigation(how to define and get sort flag in low level, 
etc).
   
   > I mean if there are already smaller batches in the in progress array and 
the condition holds, the array will be created and in progress batches will be 
copied, creating two batches.
   It would be worth testing I think only having the short path if the 
condition holds and there are no (smaller) in progress batches, so there is 
only one output batch being generated.
   
   Interesting, so if i make sense right, there are two cases for it, for 
example we target batch size is 1000, input batches:
   
   1. 20, 20, 700, 20, 30, 700, 30, 20, 10, 15
   2. 20, 20, 30, 700, 600, 700, 900, 700, 600
   
   
   So we can only apply short path for continue large batch, the case 2, 
because when we meet continue large batches, we will not have progress batches, 
so we can output directly.
   
   But for case 1, we will not apply the short path, we will keep to coalesce 
it, because before each large batch, we already have small batches in progress, 
so we don't  apply short path.
   
   Am i right? I am not sure the performance result, i need to testing with 
datafusion, thanks!
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: support push batch direct to completed as a public API and add biggest coalesce batch support [arrow-rs]

Reply via email to