[ 
https://issues.apache.org/jira/browse/ARROW-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-11739:
---------------------------------
    Description: 
Following the discussion on https://github.com/apache/arrow/pull/9187.

Proposed API in BaseVariableWidthVector.java:

{code:java}
/**
   * Get the potential buffer size for a particular number of records and 
density.
   * @param valueCount desired number of elements in the vector
   * @param density average number of bytes per variable width element
   * @return estimated size of underlying buffers if the vector holds
   *         a given number of elements
   */
public int getBufferSizeFor(final int valueCount, double density)
{code}

The current `getBufferSizeFor(int valueCount)` for BaseVariableWidthVector 
requires that validity and offset vectors have already been allocated for at 
least the given `valueCount`. If the aim of this method is to estimate memory 
usage for a value count, it's not very useful because it can only give sizes 
for less than or equal value counts in the currently allocated vector.

A better approach for approximating memory usage is to include a density 
argument, along with value count. Then the buffer estimate does not require the 
validity and offset vector to have any allocation. This also is inline with 
`setInitialCapacity(int valueCount, double density)`

NOTE: this API should also be added to BaseLargeVariableWidthVector and 
possibly BaseRepeatedValueVector(Large) as well.

  was:
Following the discussion on https://github.com/apache/arrow/pull/9187.

The current `getBufferSize(int valueCount)` for BaseVariableWidthVector 
requires that validity and offset vectors have already been allocated for at 
least the given `valueCount`. If the aim of this method is to estimate the 


> [Java] Add API for getBufferSizeFor() with density to BaseVariableWidthVector
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-11739
>                 URL: https://issues.apache.org/jira/browse/ARROW-11739
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Bryan Cutler
>            Priority: Major
>
> Following the discussion on https://github.com/apache/arrow/pull/9187.
> Proposed API in BaseVariableWidthVector.java:
> {code:java}
> /**
>    * Get the potential buffer size for a particular number of records and 
> density.
>    * @param valueCount desired number of elements in the vector
>    * @param density average number of bytes per variable width element
>    * @return estimated size of underlying buffers if the vector holds
>    *         a given number of elements
>    */
> public int getBufferSizeFor(final int valueCount, double density)
> {code}
> The current `getBufferSizeFor(int valueCount)` for BaseVariableWidthVector 
> requires that validity and offset vectors have already been allocated for at 
> least the given `valueCount`. If the aim of this method is to estimate memory 
> usage for a value count, it's not very useful because it can only give sizes 
> for less than or equal value counts in the currently allocated vector.
> A better approach for approximating memory usage is to include a density 
> argument, along with value count. Then the buffer estimate does not require 
> the validity and offset vector to have any allocation. This also is inline 
> with `setInitialCapacity(int valueCount, double density)`
> NOTE: this API should also be added to BaseLargeVariableWidthVector and 
> possibly BaseRepeatedValueVector(Large) as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to