adriangb commented on code in PR #18014:
URL: https://github.com/apache/datafusion/pull/18014#discussion_r2422817514
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -1007,12 +1071,37 @@ impl RepartitionExec {
let timer = metrics.send_time[partition].timer();
// if there is still a receiver, send to it
- if let Some((tx, reservation)) =
output_channels.get_mut(&partition) {
- reservation.lock().try_grow(size)?;
-
- if tx.send(Some(Ok(batch))).await.is_err() {
+ if let Some((tx, reservation, spill_manager)) =
+ output_channels.get_mut(&partition)
+ {
+ let (batch_to_send, is_memory_batch) =
+ match reservation.lock().try_grow(size) {
+ Ok(_) => {
+ // Memory available - send in-memory batch
+ (RepartitionBatch::Memory(batch), true)
+ }
+ Err(_) => {
+ // We're memory limited - spill this single
batch to its own file
Review Comment:
So you always append to the end, there is no way AFAIK to "insert" into the
beginning of a file. But you want to read the batches in order so like you say
you'd have to store offsets and read from the beginning of the file. But then
once you've read that batch from the beginning do you just leave the data
there? Or do you rewrite the entire file?
Also keep in mind there's concurrency issues: since upstream partitions are
executing in parallel we may end up trying to write to the same spill file
concurrently if we use one spill file per upstream partition.
I understand why you feel that opening, writing, closing, opening, reading,
deleting is excessive but unless if causes issues or is measurably worse in
benchmarks than an alternative I'm inclined to just accept it at least for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]