[ 
https://issues.apache.org/jira/browse/ARROW-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338424#comment-16338424
 ] 

ASF GitHub Bot commented on ARROW-2019:
---------------------------------------

jacques-n commented on a change in pull request #1497: ARROW-2019: [JAVA] 
Control the memory allocated for inner vector in LIST
URL: https://github.com/apache/arrow/pull/1497#discussion_r163711431
 
 

 ##########
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
 ##########
 @@ -102,6 +97,60 @@ public void initializeChildrenFromFields(List<Field> 
children) {
     
addOrGetVector.getVector().initializeChildrenFromFields(field.getChildren());
   }
 
+  @Override
+  public void setInitialCapacity(int numRecords) {
+    validityAllocationSizeInBytes = getValidityBufferSizeFromCount(numRecords);
+    super.setInitialCapacity(numRecords);
+  }
+
+  /**
+   * Specialized version of setInitialCapacity() for ListVector. This is
+   * used by some callers when they want to explicitly control and be
+   * conservative about memory allocated for inner data vector. This is
+   * very useful when we are working with memory constraints for a query
+   * and have a fixed amount of memory reserved for the record batch. In
+   * such cases, we are likely to face OOM or related problems when
+   * we reserve memory for a record batch with value count x and
+   * do setInitialCapacity(x) such that each vector allocates only
+   * what is necessary and not the default amount but the multiplier
+   * forces the memory requirement to go beyond what was needed.
+   *
+   * @param numRecords value count
+   * @param density density of ListVector. Density is the average size of
+   *                list per position in the List vector. For example, a
+   *                density value of 10 implies each position in the list
+   *                vector has a list of 10 values.
+   *                A density value of 0.1 implies out of 10 positions in
+   *                the list vector, 1 position has a list of size 1 and
+   *                remaining positions are null (no lists). This helps
 
 Review comment:
   null (no lists) or empty lists

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Control the memory allocated for inner vector in LIST
> -----------------------------------------------------
>
>                 Key: ARROW-2019
>                 URL: https://issues.apache.org/jira/browse/ARROW-2019
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Siddharth Teotia
>            Assignee: Siddharth Teotia
>            Priority: Critical
>              Labels: pull-request-available
>
> We have observed cases in our external sort code where the amount of memory 
> actually allocated for a record batch sometimes turns out to be more than 
> necessary and also more than what was reserved by the operator for special 
> purposes. Thus queries fail with OOM.
> Usually to control the memory allocated by vector.allocateNew() is to do a 
> setInitialCapacity() and the latter modifies the vector state variables which 
> are then used to allocate memory. However, due to the multiplier of 5 used in 
> List Vector, we end up asking for more memory than necessary. For example, 
> for a value count of 4095, we asked for 128KB of memory for an offset buffer 
> of VarCharVector for a field which was list of varchars. 
> We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 
> allocation). 
> We had earlier made changes to setInitialCapacity() of ListVector when we 
> were facing problems with deeply nested lists and decided to use the 
> multiplier only for the leaf scalar vector. 
> It looks like there is a need for a specialized setInitialCapacity() for 
> ListVector where the caller dictates the repeatedness.
> Also, there is another bug in setInitialCapacity() where the allocation of 
> validity buffer doesn't obey the capacity specified in setInitialCapacity(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to