gabotechs commented on code in PR #22010:
URL: https://github.com/apache/datafusion/pull/22010#discussion_r3237020075
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -143,11 +144,183 @@ type MaybeBatch = Option<Result<RepartitionBatch>>;
type InputPartitionsToCurrentPartitionSender =
Vec<DistributionSender<MaybeBatch>>;
type InputPartitionsToCurrentPartitionReceiver =
Vec<DistributionReceiver<MaybeBatch>>;
-/// Output channel with its associated memory reservation and spill writer
+/// One input task's collection of output channels (its send-side view of
+/// every output partition). Owns the per-call helpers for coalescing,
+/// finalizing, and sending.
+///
+/// This complements [`PartitionChannels`] (the per-output-partition,
+/// authoritative struct that owns `rx`, `spill_readers`, and the underlying
+/// `Mutex<LimitedBatchCoalescer>` / `AtomicUsize`). Each [`OutputChannel`]
+/// inside `inner` holds cloned `Arc`s pointing back at those shared
+/// resources.
+struct OutputChannels {
+ inner: HashMap<usize, OutputChannel>,
+ metrics: RepartitionMetrics,
+}
+
+impl OutputChannels {
+ fn new(inner: HashMap<usize, OutputChannel>, metrics: RepartitionMetrics)
-> Self {
+ Self { inner, metrics }
+ }
+
+ fn is_empty(&self) -> bool {
+ self.inner.is_empty()
+ }
+
+ fn metrics(&self) -> &RepartitionMetrics {
+ &self.metrics
+ }
+
+ /// Push `batch` for `partition` through its shared coalescer (if any)
+ /// and ship any newly completed batches through the channel.
+ async fn coalesce_and_send(
+ &mut self,
+ partition: usize,
+ batch: RecordBatch,
+ ) -> Result<()> {
+ let Some(channel) = self.inner.get(&partition) else {
+ return Ok(());
+ };
+ let to_send = match &channel.shared_coalescer {
+ Some(shared) => shared.push_and_drain(batch)?,
+ None => vec![batch],
+ };
+ for batch in to_send {
+ self.send_to_channel(partition, batch).await;
+ }
+ Ok(())
+ }
+
+ /// For each output partition this task still has, decrement the shared
+ /// active-senders counter. The last task to do so calls
+ /// [`SharedCoalescer::finalize`] and ships the residual.
+ async fn finalize(&mut self) -> Result<()> {
+ let partitions: Vec<usize> = self.inner.keys().copied().collect();
+ for partition in partitions {
+ let Some(channel) = self.inner.get(&partition) else {
+ continue;
+ };
+ let Some(shared) = channel.shared_coalescer.clone() else {
+ continue;
+ };
+ for batch in shared.finalize()? {
+ self.send_to_channel(partition, batch).await;
+ }
+ }
+ Ok(())
+ }
+
+ /// Send a single batch through the channel for `partition`, applying
+ /// the memory reservation / spill-writer fallback. Removes the channel
+ /// from `self.inner` if the receiver has hung up.
+ async fn send_to_channel(&mut self, partition: usize, batch: RecordBatch) {
+ let size = batch.get_array_memory_size();
+ let timer = self.metrics.send_time[partition].timer();
+
+ // Decide the payload outside of any await: never hold a MutexGuard
+ // across an await point.
+ let (payload, is_memory_batch) = {
+ let Some(channel) = self.inner.get(&partition) else {
+ timer.done();
+ return;
+ };
+ match channel.reservation.try_grow(size) {
Review Comment:
I think this is how it worked before:
https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/repartition/mod.rs#L1778-L1814
Before, the batches were getting coalesced later in the `RepartitionExec`
pipeline, and there was no memory accounting happening there, so I'd expect the
memory reservation to behave the same as `main`.
Not implying that's the best approach though... as I invest time in
improving this operator, I find it a bit hard to deal with, as it seems to have
accumulated several stacked abstractions (structs representing things) over the
years, and it's a bit hard to maintain.
I'd not be surprised if a full rewrite could yield a more maintainable
outcome.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]