[
https://issues.apache.org/jira/browse/ARROW-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336905#comment-16336905
]
ASF GitHub Bot commented on ARROW-2019:
---------------------------------------
siddharthteotia commented on issue #1497: ARROW-2019: [JAVA] Control the memory
allocated for inner vector in LIST
URL: https://github.com/apache/arrow/pull/1497#issuecomment-360021601
@jacques-n , in addition to the new API of setInitialCapacity(valueCount,
multiplier), another bug has been fixed as part of this patch.
Until now setInitialCapacity() was a part of super class
BaseRepeatedValueVector so doing setInitialCapacity() would not control the
allocation of validity buffer which is part of sub class ListVector. The call
only impacted the offset buffer and data vector since they are members of super
class.
So if we do setInitialCapacity(512), it would still allocate validity buffer
with default state, i.e for 4096 values. Subsequently, getValueCapacity() does
a MIN (offset buffer value capacity - 1, validity buffer capacity) and that's
why the result was 1023 --
we allocated offset buffer for (512 + 1) * 4 => 2052 bytes => 4096 bytes and
thus 1024 value capacity of offset buffer. So the result was 1023.
In other words, getValueCapacity() was previously returning the value
capacity of offset buffer. Now since both versions of setInitialCapacity() are
implemented in the base class, it correctly reflects the value capacity w.r.t
validity buffer.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Control the memory allocated for inner vector in LIST
> -----------------------------------------------------
>
> Key: ARROW-2019
> URL: https://issues.apache.org/jira/browse/ARROW-2019
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Siddharth Teotia
> Assignee: Siddharth Teotia
> Priority: Critical
> Labels: pull-request-available
>
> We have observed cases in our external sort code where the amount of memory
> actually allocated for a record batch sometimes turns out to be more than
> necessary and also more than what was reserved by the operator for special
> purposes. Thus queries fail with OOM.
> Usually to control the memory allocated by vector.allocateNew() is to do a
> setInitialCapacity() and the latter modifies the vector state variables which
> are then used to allocate memory. However, due to the multiplier of 5 used in
> List Vector, we end up asking for more memory than necessary. For example,
> for a value count of 4095, we asked for 128KB of memory for an offset buffer
> of VarCharVector for a field which was list of varchars.
> We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2
> allocation).
> We had earlier made changes to setInitialCapacity() of ListVector when we
> were facing problems with deeply nested lists and decided to use the
> multiplier only for the leaf scalar vector.
> It looks like there is a need for a specialized setInitialCapacity() for
> ListVector where the caller dictates the repeatedness.
> Also, there is another bug in setInitialCapacity() where the allocation of
> validity buffer doesn't obey the capacity specified in setInitialCapacity().
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)