[ https://issues.apache.org/jira/browse/ARROW-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Teotia updated ARROW-2019: ------------------------------------ Component/s: Java - Vectors > Control the memory allocated for inner vector in LIST > ----------------------------------------------------- > > Key: ARROW-2019 > URL: https://issues.apache.org/jira/browse/ARROW-2019 > Project: Apache Arrow > Issue Type: Improvement > Components: Java - Vectors > Reporter: Siddharth Teotia > Assignee: Siddharth Teotia > Priority: Critical > Labels: pull-request-available > Fix For: 0.9.0 > > > We have observed cases in our external sort code where the amount of memory > actually allocated for a record batch sometimes turns out to be more than > necessary and also more than what was reserved by the operator for special > purposes. Thus queries fail with OOM. > Usually to control the memory allocated by vector.allocateNew() is to do a > setInitialCapacity() and the latter modifies the vector state variables which > are then used to allocate memory. However, due to the multiplier of 5 used in > List Vector, we end up asking for more memory than necessary. For example, > for a value count of 4095, we asked for 128KB of memory for an offset buffer > of VarCharVector for a field which was list of varchars. > We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 > allocation). > We had earlier made changes to setInitialCapacity() of ListVector when we > were facing problems with deeply nested lists and decided to use the > multiplier only for the leaf scalar vector. > It looks like there is a need for a specialized setInitialCapacity() for > ListVector where the caller dictates the repeatedness. > Also, there is another bug in setInitialCapacity() where the allocation of > validity buffer doesn't obey the capacity specified in setInitialCapacity(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)