Re: [PR] chore: fix native shuffle for thin batches [datafusion-comet]

via GitHub Wed, 01 Apr 2026 10:25:12 -0700


mbutrovich commented on PR #3858:
URL: 
https://github.com/apache/datafusion-comet/pull/3858#issuecomment-4171727830


   > > Maybe that's a premature optimization, but it seems a bit silly to me if 
we could end up writing a bunch of empty IPC batches.
   > 
   > IMO we filter them out inside shuffle writer but before IPC, but this is 
valid point, perhaps we can move this check up earlier, checking this
   
   You could detect `schema.fields().is_empty()` once at the start of 
`partitioning_batch()` and just accumulate `partition_row_counts: Vec<usize>` 
per partition. Then `shuffle_write` emits one `RecordBatch` per partition with 
the summed row count. That would be O(1) memory and work instead of O(N) in the 
number of rows


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] chore: fix native shuffle for thin batches [datafusion-comet]

Reply via email to