alamb commented on code in PR #22010:
URL: https://github.com/apache/datafusion/pull/22010#discussion_r3235171117
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -143,11 +144,183 @@ type MaybeBatch = Option<Result<RepartitionBatch>>;
type InputPartitionsToCurrentPartitionSender =
Vec<DistributionSender<MaybeBatch>>;
type InputPartitionsToCurrentPartitionReceiver =
Vec<DistributionReceiver<MaybeBatch>>;
-/// Output channel with its associated memory reservation and spill writer
+/// One input task's collection of output channels (its send-side view of
+/// every output partition). Owns the per-call helpers for coalescing,
+/// finalizing, and sending.
+///
+/// This complements [`PartitionChannels`] (the per-output-partition,
+/// authoritative struct that owns `rx`, `spill_readers`, and the underlying
+/// `Mutex<LimitedBatchCoalescer>` / `AtomicUsize`). Each [`OutputChannel`]
+/// inside `inner` holds cloned `Arc`s pointing back at those shared
+/// resources.
+struct OutputChannels {
+ inner: HashMap<usize, OutputChannel>,
+ metrics: RepartitionMetrics,
+}
+
+impl OutputChannels {
+ fn new(inner: HashMap<usize, OutputChannel>, metrics: RepartitionMetrics)
-> Self {
+ Self { inner, metrics }
+ }
+
+ fn is_empty(&self) -> bool {
+ self.inner.is_empty()
+ }
+
+ fn metrics(&self) -> &RepartitionMetrics {
+ &self.metrics
+ }
+
+ /// Push `batch` for `partition` through its shared coalescer (if any)
+ /// and ship any newly completed batches through the channel.
+ async fn coalesce_and_send(
+ &mut self,
+ partition: usize,
+ batch: RecordBatch,
+ ) -> Result<()> {
+ let Some(channel) = self.inner.get(&partition) else {
+ return Ok(());
+ };
+ let to_send = match &channel.shared_coalescer {
+ Some(shared) => shared.push_and_drain(batch)?,
+ None => vec![batch],
+ };
+ for batch in to_send {
+ self.send_to_channel(partition, batch).await;
+ }
+ Ok(())
+ }
+
+ /// For each output partition this task still has, decrement the shared
+ /// active-senders counter. The last task to do so calls
+ /// [`SharedCoalescer::finalize`] and ships the residual.
+ async fn finalize(&mut self) -> Result<()> {
+ let partitions: Vec<usize> = self.inner.keys().copied().collect();
Review Comment:
I wonder if we could avoid this copy/ allocation and just consume the inner
```rust
async fn finalize(&mut self) -> Result<()> {
for (partition, channel) in self.inner.drain() {
let Some(shared) = channel.shared_coalescer.clone() else {
continue;
};
for batch in shared.finalize()? {
self.send_to_channel(partition, batch).await;
}
}
Ok(())
}
```
it would probably require rejiggering how `self.send_to_channel` worked to
get a passed channel (rather than looking it up again)
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -143,11 +144,183 @@ type MaybeBatch = Option<Result<RepartitionBatch>>;
type InputPartitionsToCurrentPartitionSender =
Vec<DistributionSender<MaybeBatch>>;
type InputPartitionsToCurrentPartitionReceiver =
Vec<DistributionReceiver<MaybeBatch>>;
-/// Output channel with its associated memory reservation and spill writer
+/// One input task's collection of output channels (its send-side view of
+/// every output partition). Owns the per-call helpers for coalescing,
+/// finalizing, and sending.
+///
+/// This complements [`PartitionChannels`] (the per-output-partition,
+/// authoritative struct that owns `rx`, `spill_readers`, and the underlying
+/// `Mutex<LimitedBatchCoalescer>` / `AtomicUsize`). Each [`OutputChannel`]
+/// inside `inner` holds cloned `Arc`s pointing back at those shared
+/// resources.
+struct OutputChannels {
+ inner: HashMap<usize, OutputChannel>,
+ metrics: RepartitionMetrics,
+}
+
+impl OutputChannels {
+ fn new(inner: HashMap<usize, OutputChannel>, metrics: RepartitionMetrics)
-> Self {
+ Self { inner, metrics }
+ }
+
+ fn is_empty(&self) -> bool {
+ self.inner.is_empty()
+ }
+
+ fn metrics(&self) -> &RepartitionMetrics {
+ &self.metrics
+ }
+
+ /// Push `batch` for `partition` through its shared coalescer (if any)
+ /// and ship any newly completed batches through the channel.
+ async fn coalesce_and_send(
+ &mut self,
+ partition: usize,
+ batch: RecordBatch,
+ ) -> Result<()> {
+ let Some(channel) = self.inner.get(&partition) else {
+ return Ok(());
+ };
+ let to_send = match &channel.shared_coalescer {
+ Some(shared) => shared.push_and_drain(batch)?,
+ None => vec![batch],
+ };
+ for batch in to_send {
+ self.send_to_channel(partition, batch).await;
+ }
+ Ok(())
+ }
+
+ /// For each output partition this task still has, decrement the shared
+ /// active-senders counter. The last task to do so calls
+ /// [`SharedCoalescer::finalize`] and ships the residual.
+ async fn finalize(&mut self) -> Result<()> {
+ let partitions: Vec<usize> = self.inner.keys().copied().collect();
+ for partition in partitions {
+ let Some(channel) = self.inner.get(&partition) else {
+ continue;
+ };
+ let Some(shared) = channel.shared_coalescer.clone() else {
+ continue;
+ };
+ for batch in shared.finalize()? {
+ self.send_to_channel(partition, batch).await;
+ }
+ }
+ Ok(())
+ }
+
+ /// Send a single batch through the channel for `partition`, applying
+ /// the memory reservation / spill-writer fallback. Removes the channel
+ /// from `self.inner` if the receiver has hung up.
+ async fn send_to_channel(&mut self, partition: usize, batch: RecordBatch) {
+ let size = batch.get_array_memory_size();
+ let timer = self.metrics.send_time[partition].timer();
+
+ // Decide the payload outside of any await: never hold a MutexGuard
+ // across an await point.
+ let (payload, is_memory_batch) = {
+ let Some(channel) = self.inner.get(&partition) else {
+ timer.done();
+ return;
+ };
+ match channel.reservation.try_grow(size) {
Review Comment:
Claude pointed out that this code only increses the memory reservation
*after* coalescing the batch -- so in other words the memory reservation
doesn't account for the memory in the SharedCoalescer -- for really wide
fanouts that could be non trivial memory (num_partitions batches worth of
memory)
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -143,11 +144,183 @@ type MaybeBatch = Option<Result<RepartitionBatch>>;
type InputPartitionsToCurrentPartitionSender =
Vec<DistributionSender<MaybeBatch>>;
type InputPartitionsToCurrentPartitionReceiver =
Vec<DistributionReceiver<MaybeBatch>>;
-/// Output channel with its associated memory reservation and spill writer
+/// One input task's collection of output channels (its send-side view of
+/// every output partition). Owns the per-call helpers for coalescing,
+/// finalizing, and sending.
+///
+/// This complements [`PartitionChannels`] (the per-output-partition,
+/// authoritative struct that owns `rx`, `spill_readers`, and the underlying
+/// `Mutex<LimitedBatchCoalescer>` / `AtomicUsize`). Each [`OutputChannel`]
+/// inside `inner` holds cloned `Arc`s pointing back at those shared
+/// resources.
+struct OutputChannels {
+ inner: HashMap<usize, OutputChannel>,
+ metrics: RepartitionMetrics,
+}
+
+impl OutputChannels {
+ fn new(inner: HashMap<usize, OutputChannel>, metrics: RepartitionMetrics)
-> Self {
+ Self { inner, metrics }
+ }
+
+ fn is_empty(&self) -> bool {
+ self.inner.is_empty()
+ }
+
+ fn metrics(&self) -> &RepartitionMetrics {
+ &self.metrics
+ }
+
+ /// Push `batch` for `partition` through its shared coalescer (if any)
+ /// and ship any newly completed batches through the channel.
+ async fn coalesce_and_send(
+ &mut self,
+ partition: usize,
+ batch: RecordBatch,
+ ) -> Result<()> {
+ let Some(channel) = self.inner.get(&partition) else {
+ return Ok(());
Review Comment:
this just drops the input -- maybe it is worth pointing out in a comment
when this can happen (I think when the output partition has been closed)
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -1055,6 +1247,7 @@ impl ExecutionPlan for RepartitionExec {
let name = self.name().to_owned();
let schema = self.schema();
let schema_captured = Arc::clone(&schema);
+ let fetch = self.fetch();
Review Comment:
I think this is always `None` (because `RepartitionExec` doesn't provide an
implementation). I think it would be clearer if you explicitly set fetch to
None here, and maybe left a comment / ticket reference to implement limiting
within the repartition (you'll be in good shape to do this with the
LimitedBatchCoalescer)
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -143,11 +144,183 @@ type MaybeBatch = Option<Result<RepartitionBatch>>;
type InputPartitionsToCurrentPartitionSender =
Vec<DistributionSender<MaybeBatch>>;
type InputPartitionsToCurrentPartitionReceiver =
Vec<DistributionReceiver<MaybeBatch>>;
-/// Output channel with its associated memory reservation and spill writer
+/// One input task's collection of output channels (its send-side view of
+/// every output partition). Owns the per-call helpers for coalescing,
+/// finalizing, and sending.
+///
+/// This complements [`PartitionChannels`] (the per-output-partition,
+/// authoritative struct that owns `rx`, `spill_readers`, and the underlying
+/// `Mutex<LimitedBatchCoalescer>` / `AtomicUsize`). Each [`OutputChannel`]
+/// inside `inner` holds cloned `Arc`s pointing back at those shared
+/// resources.
+struct OutputChannels {
+ inner: HashMap<usize, OutputChannel>,
Review Comment:
I agree using Vec is better (faster, clearer what is allowed). Maybe as a
follwo on PR
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -143,11 +144,183 @@ type MaybeBatch = Option<Result<RepartitionBatch>>;
type InputPartitionsToCurrentPartitionSender =
Vec<DistributionSender<MaybeBatch>>;
type InputPartitionsToCurrentPartitionReceiver =
Vec<DistributionReceiver<MaybeBatch>>;
-/// Output channel with its associated memory reservation and spill writer
+/// One input task's collection of output channels (its send-side view of
+/// every output partition). Owns the per-call helpers for coalescing,
+/// finalizing, and sending.
+///
+/// This complements [`PartitionChannels`] (the per-output-partition,
+/// authoritative struct that owns `rx`, `spill_readers`, and the underlying
+/// `Mutex<LimitedBatchCoalescer>` / `AtomicUsize`). Each [`OutputChannel`]
+/// inside `inner` holds cloned `Arc`s pointing back at those shared
+/// resources.
+struct OutputChannels {
+ inner: HashMap<usize, OutputChannel>,
+ metrics: RepartitionMetrics,
+}
+
+impl OutputChannels {
+ fn new(inner: HashMap<usize, OutputChannel>, metrics: RepartitionMetrics)
-> Self {
+ Self { inner, metrics }
+ }
+
+ fn is_empty(&self) -> bool {
+ self.inner.is_empty()
+ }
+
+ fn metrics(&self) -> &RepartitionMetrics {
+ &self.metrics
+ }
+
+ /// Push `batch` for `partition` through its shared coalescer (if any)
+ /// and ship any newly completed batches through the channel.
+ async fn coalesce_and_send(
+ &mut self,
+ partition: usize,
+ batch: RecordBatch,
+ ) -> Result<()> {
+ let Some(channel) = self.inner.get(&partition) else {
+ return Ok(());
+ };
+ let to_send = match &channel.shared_coalescer {
+ Some(shared) => shared.push_and_drain(batch)?,
+ None => vec![batch],
+ };
+ for batch in to_send {
+ self.send_to_channel(partition, batch).await;
+ }
+ Ok(())
+ }
+
+ /// For each output partition this task still has, decrement the shared
+ /// active-senders counter. The last task to do so calls
+ /// [`SharedCoalescer::finalize`] and ships the residual.
+ async fn finalize(&mut self) -> Result<()> {
+ let partitions: Vec<usize> = self.inner.keys().copied().collect();
+ for partition in partitions {
+ let Some(channel) = self.inner.get(&partition) else {
+ continue;
+ };
+ let Some(shared) = channel.shared_coalescer.clone() else {
+ continue;
+ };
+ for batch in shared.finalize()? {
+ self.send_to_channel(partition, batch).await;
+ }
+ }
+ Ok(())
+ }
+
+ /// Send a single batch through the channel for `partition`, applying
+ /// the memory reservation / spill-writer fallback. Removes the channel
+ /// from `self.inner` if the receiver has hung up.
+ async fn send_to_channel(&mut self, partition: usize, batch: RecordBatch) {
+ let size = batch.get_array_memory_size();
+ let timer = self.metrics.send_time[partition].timer();
+
+ // Decide the payload outside of any await: never hold a MutexGuard
+ // across an await point.
+ let (payload, is_memory_batch) = {
+ let Some(channel) = self.inner.get(&partition) else {
+ timer.done();
+ return;
+ };
+ match channel.reservation.try_grow(size) {
+ Ok(_) => (Ok(RepartitionBatch::Memory(batch)), true),
+ Err(_) => match channel.spill_writer.push_batch(&batch) {
+ Ok(()) => (Ok(RepartitionBatch::Spilled), false),
+ Err(err) => (Err(err), false),
+ },
+ }
+ };
+
+ let Some(channel) = self.inner.get(&partition) else {
+ timer.done();
+ return;
+ };
+ let send_err = channel.sender.send(Some(payload)).await.is_err();
+ if send_err {
+ if is_memory_batch && let Some(channel) =
self.inner.get(&partition) {
+ channel.reservation.shrink(size);
+ }
+ self.inner.remove(&partition);
+ }
+ timer.done();
+ }
+}
+
+/// A producer-side coalescer shared across all input tasks targeting a
+/// single output partition.
+///
+/// Bundles the [`LimitedBatchCoalescer`] (behind a [`Mutex`]) with the
+/// active-sender counter that tracks how many input tasks may still push
+/// into it. The last task to call [`Self::finalize`] is the one that
+/// finalizes the coalescer and ships the residual batch.
+///
+/// Cheap to [`Clone`]: both fields are [`Arc`]s.
+#[derive(Clone)]
+struct SharedCoalescer {
+ inner: Arc<Mutex<LimitedBatchCoalescer>>,
+ active_senders: Arc<AtomicUsize>,
+}
+
+impl SharedCoalescer {
+ fn new(
+ schema: SchemaRef,
+ target_batch_size: usize,
+ fetch: Option<usize>,
+ num_senders: usize,
+ ) -> Self {
+ Self {
+ inner: Arc::new(Mutex::new(LimitedBatchCoalescer::new(
+ schema,
+ target_batch_size,
+ fetch,
+ ))),
+ active_senders: Arc::new(AtomicUsize::new(num_senders)),
+ }
+ }
+
+ /// Push `batch` into the coalescer and drain any newly completed
+ /// batches. The mutex is held only briefly.
+ fn push_and_drain(&self, batch: RecordBatch) -> Result<Vec<RecordBatch>> {
+ let mut acc = Vec::new();
+ let mut c = self.inner.lock();
+ c.push_batch(batch)?;
Review Comment:
Claude also pointed out here that the status is dropped here -- so if the
limit is reached nothing will happen
That being said, however, it seems like RepartitionExec doesn't actually
have a `fetch` field (aka the operator itself doesn't implement a limit) so
this isn't a problem
https://github.com/apache/datafusion/blob/dcf648255b92a34798871139aeba12d95f8f3032/datafusion/physical-plan/src/repartition/mod.rs#L1032-L1045
However, the fact that the code in here passes through a `fetch` is pretty
confusing as it implies the fetch can be something other than `none` -- see
comment below
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]