BryanCutler commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-766152565


   I think the intention of `getBufferSizeFor(final int valueCount)` is to 
provide an estimated buffer size of the vector and it doesn't make sense that 
the vector should have to be in a certain kind of state to get that estimate. 
And even calling `setValueCount()` doesn't provide a good estimate since that 
will just fill empty data. Since this is a variable width vector, it also 
doesn't make sense to try to get that estimate from a `valueCount` alone.
   
   A better way to get an estimate of buffer size would be to include a 
`density` value for the avg number of bytes per record, similar to 
`setInitialCapacity(int valueCount, double density)`. You could then get the 
density from a previous vector and use that to estimate the size for the next 
vector:
   
   ```java
   int batch_size = 123;
   double prev_density = prev_vector.getDensity();
   int estimated_size = new_vector.getBufferSizeFor(batch_size, prev_density);
   ```
   
   This new `getBufferSizeFor()` does not need to be in any kind of state, and 
`setValueCount()` would not need to be called before hand. What are your guys 
thoughts on this, and does that work for your use case @WeichenXu123 ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to