yjshen commented on a change in pull request #1921:
URL: https://github.com/apache/arrow-datafusion/pull/1921#discussion_r819304060
##########
File path: datafusion/src/execution/memory_manager.rs
##########
@@ -340,7 +341,13 @@ impl MemoryManager {
} else if current < min_per_rqt {
// if we cannot acquire at lease 1/2n memory, just wait for
others
// to spill instead spill self frequently with limited total
mem
- self.cv.wait(&mut rqt_current_used);
+ let timeout = self
+ .cv
+ .wait_for(&mut rqt_current_used, Duration::from_secs(5));
Review comment:
Is this wait causing infinite wait in your query? If so, is it a
possible leak somewhere else that exists, and we should fix it?
The primary reason it's waiting forever here for a notify signal is to avoid
repeated self-spilling with little memory share, producing many spill files and
degrading performance.
If a consumer cannot get at least 1/2n of memory among the total, perhaps
blocking the thread and yielding computational resources to others is better?
Then the huge consumers can progress faster. Either they trigger self-spilling
or finish their jobs.
Working with a minimum memory is not ideal because it will harm the overall
query processing throughput. Even if the repeated spilling consumer gets enough
memory later, it will need a lot of effort reading spills and merging partial
results.
We are working with an assumption of a batch engine currently. May one day,
we can have a customized task scheduler, self-tuning based on CPU memory
usages, and even with an adaptive partition size tuning capability.
Nevertheless, if we must add a timeout, please also print a warning log, as
this could be a symptom of a potential bug. And yes, I think to make it a
configurable wait is necessary. And please choose a longer timeout as default
if possible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]