[
https://issues.apache.org/jira/browse/ARROW-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-2019:
----------------------------------
Labels: pull-request-available (was: )
> Control the memory allocated for inner vector in LIST
> -----------------------------------------------------
>
> Key: ARROW-2019
> URL: https://issues.apache.org/jira/browse/ARROW-2019
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Siddharth Teotia
> Assignee: Siddharth Teotia
> Priority: Critical
> Labels: pull-request-available
>
> We have observed cases in our external sort code where the amount of memory
> actually allocated for a record batch sometimes turns out to be more than
> necessary and also more than what was reserved by the operator for special
> purposes. Thus queries fail with OOM.
> Usually to control the memory allocated by vector.allocateNew() is to do a
> setInitialCapacity() and the latter modifies the vector state variables which
> are then used to allocate memory. However, due to the multiplier of 5 used in
> List Vector, we end up asking for more memory than necessary. For example,
> for a value count of 4095, we asked for 128KB of memory for an offset buffer
> of VarCharVector for a field which was list of varchars.
> We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2
> allocation).
> We had earlier made changes to setInitialCapacity() of ListVector when we
> were facing problems with deeply nested lists and decided to use the
> multiplier only for the leaf scalar vector.
> It looks like there is a need for a specialized setInitialCapacity() for
> ListVector where the caller dictates the repeatedness.
> Also, there is another bug in setInitialCapacity() where the allocation of
> validity buffer doesn't obey the capacity specified in setInitialCapacity().
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)