alamb opened a new issue #587:
URL: https://github.com/apache/arrow-datafusion/issues/587


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   If DataFusion processes individual partitions that are larger than the 
available memory system memory, right now it will keep allocating memory from 
the system until it is killed by the OS or container system. 
   
   Also, when running multiple datafusion plans in the same process, each will 
consume memory without limit where it may be desirable to reserve / cap memory 
usage by any individual plan to ensure the plans don't together exceed the 
system memory budget
   
   Thus, it would be nice if we could give DataFusion's plans a memory budget  
which they then stayed under
   
   
   **Describe the solution you'd like**
   1. Add an option to ExecutionConfig that has a “total plan memory budget”
   2. Add logic to each node that requires a memory buffer to ensure it stays 
under the limit.
   
   The operators that can use large amounts of memory today are:
   1. Sort
   2. Join
   3. GroupByHash
   
   There are many potential ways to ensure the limit is respected:
   1. (Simplest) error if the budget is exceeded
   2. (more complex): employ algorithms that can use secondary storage (e.g. 
temp files) like sort that spills multiple round of partial sorted results and 
give a final merge phase for the partition global ordering
   
   **Describe alternatives you've considered**
   There are some interesting tradeoffs between “up front allocation” dividing 
memory up across all operators that would need it and a more dynamic approach.
   
   This is likely something that will require some major efforts over many 
different issues -- I suggest we use this issue to implement a simple "error if 
over limit" strategy and then work on more sophisticated strategies subsequently
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to