hrishisd opened a new issue, #37829:
URL: https://github.com/apache/arrow/issues/37829

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ### Arrow version
   Built from main branch as of 9/21/2023
   
   ### Problem description
   
   When appending two variable length vectors, `VectorAppender` repeatedly 
resizes the validity and offset buffers of the target vector until they can 
hold the combined elements. While doing so, it also resizes the data buffer 
which can cause the data buffer to exceed the max allocation limit when we 
append a large number of small elements to a vector with a single large element.
   ```java
   // Body of VectorAppender::visit
   
   // make sure there is enough capacity
   while (targetVector.getValueCapacity() < newValueCount) {
     targetVector.reAlloc(); // should only realloc validity and offset buffers.
   }
   while (targetVector.getDataBuffer().capacity() < newValueCapacity) {
     ((BaseVariableWidthVector) targetVector).reallocDataBuffer();
   }
   
   ```
   
   ### Steps to reproduce
   The error can be reproduced using the snippet below. 
   
   ```java
   @Test
   public void testResizingBug() {
     var allocator = new RootAllocator();
     System.err.println("max allocation size: " + 
BaseValueVector.MAX_ALLOCATION_SIZE);
     // arrow.vector.max_allocation_bytes is set to 1048576 (1 MiB)
     // create a vector with a single 256 KiB string
     VarCharVector target = makeVec(1, 256 * 1024, allocator);
     // create a vector with a total of 1 KiB
     VarCharVector delta = makeVec(1024, 1, allocator);
     // we should be able to fit all the strings into a single vector using 
less than 1 MiB.
     // this works
     new VectorAppender(delta).visit(target, null);
     // this fails
     new VectorAppender(target).visit(delta, null);
   }
   
   private static VarCharVector makeVec(int nElements, int bytesPerElement, 
BufferAllocator allocator) {
     var v = new VarCharVector("test", allocator);
     v.allocateNew(nElements);
     for (int i = 0; i < nElements; i++) {
       v.setSafe(i, 
"A".repeat(bytesPerElement).getBytes(StandardCharsets.UTF_8));
     }
     v.setValueCount(nElements);
     return v;
   }
   ```
   
   The example produces the following error
   ```
   org.apache.arrow.vector.util.OversizedAllocationException: Memory required 
for vector is (2097152), which is overflow or more than max allowed (1048576). 
You could consider using LargeVarCharVector/LargeVarBinaryVector for large 
strings/large bytes types
   
        at 
org.apache.arrow.vector.BaseVariableWidthVector.checkDataBufferSize(BaseVariableWidthVector.java:435)
        at 
org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:542)
        at 
org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:520)
        at 
org.apache.arrow.vector.BaseVariableWidthVector.reAlloc(BaseVariableWidthVector.java:497)
        at 
org.apache.arrow.vector.util.VectorAppender.visit(VectorAppender.java:119)
        at 
org.apache.arrow.vector.util.TestVectorSchemaRootAppender.testResizingBug(TestVectorSchemaRootAppender.java:68)
   ```
   It looks like the issue is also present when appending other variable-length 
vector types.
   
   ---
   
   I'm happy to post a PR if this looks reasonable. 
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to