Siddharth Teotia updated ARROW-2019:
    Fix Version/s: 0.9.0

> Control the memory allocated for inner vector in LIST
> -----------------------------------------------------
>                 Key: ARROW-2019
>                 URL: https://issues.apache.org/jira/browse/ARROW-2019
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Siddharth Teotia
>            Assignee: Siddharth Teotia
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.9.0
> We have observed cases in our external sort code where the amount of memory 
> actually allocated for a record batch sometimes turns out to be more than 
> necessary and also more than what was reserved by the operator for special 
> purposes. Thus queries fail with OOM.
> Usually to control the memory allocated by vector.allocateNew() is to do a 
> setInitialCapacity() and the latter modifies the vector state variables which 
> are then used to allocate memory. However, due to the multiplier of 5 used in 
> List Vector, we end up asking for more memory than necessary. For example, 
> for a value count of 4095, we asked for 128KB of memory for an offset buffer 
> of VarCharVector for a field which was list of varchars. 
> We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 
> allocation). 
> We had earlier made changes to setInitialCapacity() of ListVector when we 
> were facing problems with deeply nested lists and decided to use the 
> multiplier only for the leaf scalar vector. 
> It looks like there is a need for a specialized setInitialCapacity() for 
> ListVector where the caller dictates the repeatedness.
> Also, there is another bug in setInitialCapacity() where the allocation of 
> validity buffer doesn't obey the capacity specified in setInitialCapacity(). 

This message was sent by Atlassian JIRA

Reply via email to