Re: [PR] Use `interleave` in hash repartitioning [datafusion]

via GitHub Sun, 20 Apr 2025 07:03:31 -0700


Dandandan commented on code in PR #15768:
URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051731884



##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -298,25 +299,15 @@ impl BatchPartitioner {
                         .into_iter()
                         .enumerate()
                         .filter_map(|(partition, indices)| {
-                            let indices: PrimitiveArray<UInt32Type> = 
indices.into();
                             (!indices.is_empty()).then_some((partition, 
indices))
                         })
                         .map(move |(partition, indices)| {
                             // Tracking time required for repartitioned 
batches construction
                             let _timer = partitioner_timer.timer();
+                            let b: Vec<&RecordBatch> = 
batches.iter().collect();
 
                             // Produce batches based on indices
-                            let columns = take_arrays(batch.columns(), 
&indices, None)?;
-
-                            let mut options = RecordBatchOptions::new();
-                            options = 
options.with_row_count(Some(indices.len()));
-                            let batch = RecordBatch::try_new_with_options(
-                                batch.schema(),
-                                columns,
-                                &options,
-                            )
-                            .unwrap();
-
+                            let batch = interleave_record_batch(&b, &indices)?;

Review Comment:
   I think removing coalesce after this change (for all hash repartitions) 
might be possible, as the output batch size will be roughly equal to input 
batch size (instead of roughly 1/partitions * batch_size). Unless hash values 
are somehow skewed (but this is currently also not good anyway).
   
   A future api could use your `take_in` api maybe to only output rows once 
batch size has been reached.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

Reply via email to