adriangb commented on code in PR #18014:
URL: https://github.com/apache/datafusion/pull/18014#discussion_r2422792326
##########
datafusion/physical-plan/src/repartition/mod.rs:
##########
@@ -1007,12 +1071,37 @@ impl RepartitionExec {
let timer = metrics.send_time[partition].timer();
// if there is still a receiver, send to it
- if let Some((tx, reservation)) =
output_channels.get_mut(&partition) {
- reservation.lock().try_grow(size)?;
-
- if tx.send(Some(Ok(batch))).await.is_err() {
+ if let Some((tx, reservation, spill_manager)) =
+ output_channels.get_mut(&partition)
+ {
+ let (batch_to_send, is_memory_batch) =
+ match reservation.lock().try_grow(size) {
+ Ok(_) => {
+ // Memory available - send in-memory batch
+ (RepartitionBatch::Memory(batch), true)
+ }
+ Err(_) => {
+ // We're memory limited - spill this single
batch to its own file
Review Comment:
> yes, having a file per batch can get process to get killed (by the os) due
to too many open files
I checked and indeed we close the files after writing so this is not going
to happen. the only thing we accumulate is `PathBuf`s, not open file
descriptors.
> closing and then opening files for will bring at least 6 system calls per
batch
> Ideally if we can keep file open and then share offsets after write, would
save lot of syscals
I think that would be nice but I can't think of a scheme that would make
that work: reading from the end of the file isn't compatible with our use case
because we *need* it to be FIFO but using a single file only really works if
you do LIFO. Do you have any ideas on how we could do this in a relatively
simple way?
Given that any query that would spill here would have previously errored out
I'm not too worried about performance. And frankly I think if batch sizes are
reasonable a couple extra sys calls won't be measurable.
That said since how the spilling happens is all very internal / private
there's no reason we can't merge this and then come back and improve it later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]