Lei Sun created ORC-846:
---------------------------

             Summary: Refactoring Memory-Manager for better extensibility 
                 Key: ORC-846
                 URL: https://issues.apache.org/jira/browse/ORC-846
             Project: ORC
          Issue Type: Improvement
            Reporter: Lei Sun
            Assignee: Lei Sun


Hi ORC Community, 

In our use case, dynamic partitioning is quite common and our engine 
([Gobblin|[https://github.com/apache/gobblin])] doesn't have a shuffle stage so 
that having single thread dealing with multiple writers (dealing with different 
partitions) is a usual pattern. When the number of partitions reach a large 
number, there will be challenges in avoiding OOM given the current 
memory-manager implementation: 
 * `rows.between.memory.check` is on writer level (the condition gating the 
expensive check between estimated memory of treeWriter versus memoryLimit of 
each writer), and memory manager is not aware of it.  memory manager only 
control the "scale" of allocation which hints each writer to reduce the 
threshold of memory limit (initial to StripSize) 

 

What is proposed here is to: 
 * Having centralized awareness of how many rows have been bufferred among all 
writers, in the memoryManager. The data structure being used will be 
thread-safe so that the issue fixed in 
https://issues.apache.org/jira/browse/ORC-361# won't re-surface. There's no 
additional synchronization introduced beyond the intrinsic control from the 
concurrent data structure managing how much rows buffered in each writer. The 
existing memory-manager will be interfacing with each writer in a 
backward-compatible way. 

 

With this, the existing memory manager can be extended in terms of controlling 
flush in each -writer's granularity and treat different writers with priority.  
e.g. some of the writers dealing with less favored partition could be flushed 
more often so that the overall pressure on memory could be reduced, all these 
"prioritization" should be localized to the engines that uses orc-writer. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to