Siddharth Teotia created ARROW-2019:
---------------------------------------

             Summary: Control the memory allocated for inner vector in LIST
                 Key: ARROW-2019
                 URL: https://issues.apache.org/jira/browse/ARROW-2019
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Siddharth Teotia
            Assignee: Siddharth Teotia


We have observed cases in our external sort code where the amount of memory 
actually allocated for a record batch sometimes turns out to be more than 
necessary and also more than what was reserved by the operator for special 
purposes. Thus queries fail with OOM.

Usually to control the memory allocated by vector.allocateNew() is to do a 
setInitialCapacity() and the latter modifies the vector state variables which 
are then used to allocate memory. However, due to the multiplier of 5 used in 
List Vector, we end up asking for more memory than necessary. For example, for 
a value count of 4095, we asked for 128KB of memory for an offset buffer of 
VarCharVector for a field which was list of varchars. 

We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 
allocation). 

We had earlier made changes to setInitialCapacity() of ListVector when we were 
facing problems with deeply nested lists and decided to use the multiplier only 
for the leaf scalar vector. 

It looks like there is a need for a specialized setInitialCapacity() for 
ListVector where the caller dictates the repeatedness.

Also, there is another bug in setInitialCapacity() where the allocation of 
validity buffer doesn't obey the capacity specified in setInitialCapacity(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to