hrishisd opened a new issue, #37829:
URL: https://github.com/apache/arrow/issues/37829
### Describe the bug, including details regarding any error messages,
version, and platform.
### Arrow version
Built from main branch as of 9/21/2023
### Problem description
When appending two variable length vectors, `VectorAppender` repeatedly
resizes the validity and offset buffers of the target vector until they can
hold the combined elements. While doing so, it also resizes the data buffer
which can cause the data buffer to exceed the max allocation limit when we
append a large number of small elements to a vector with a single large element.
```java
// Body of VectorAppender::visit
// make sure there is enough capacity
while (targetVector.getValueCapacity() < newValueCount) {
targetVector.reAlloc(); // should only realloc validity and offset buffers.
}
while (targetVector.getDataBuffer().capacity() < newValueCapacity) {
((BaseVariableWidthVector) targetVector).reallocDataBuffer();
}
```
### Steps to reproduce
The error can be reproduced using the snippet below.
```java
@Test
public void testResizingBug() {
var allocator = new RootAllocator();
System.err.println("max allocation size: " +
BaseValueVector.MAX_ALLOCATION_SIZE);
// arrow.vector.max_allocation_bytes is set to 1048576 (1 MiB)
// create a vector with a single 256 KiB string
VarCharVector target = makeVec(1, 256 * 1024, allocator);
// create a vector with a total of 1 KiB
VarCharVector delta = makeVec(1024, 1, allocator);
// we should be able to fit all the strings into a single vector using
less than 1 MiB.
// this works
new VectorAppender(delta).visit(target, null);
// this fails
new VectorAppender(target).visit(delta, null);
}
private static VarCharVector makeVec(int nElements, int bytesPerElement,
BufferAllocator allocator) {
var v = new VarCharVector("test", allocator);
v.allocateNew(nElements);
for (int i = 0; i < nElements; i++) {
v.setSafe(i,
"A".repeat(bytesPerElement).getBytes(StandardCharsets.UTF_8));
}
v.setValueCount(nElements);
return v;
}
```
The example produces the following error
```
org.apache.arrow.vector.util.OversizedAllocationException: Memory required
for vector is (2097152), which is overflow or more than max allowed (1048576).
You could consider using LargeVarCharVector/LargeVarBinaryVector for large
strings/large bytes types
at
org.apache.arrow.vector.BaseVariableWidthVector.checkDataBufferSize(BaseVariableWidthVector.java:435)
at
org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:542)
at
org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:520)
at
org.apache.arrow.vector.BaseVariableWidthVector.reAlloc(BaseVariableWidthVector.java:497)
at
org.apache.arrow.vector.util.VectorAppender.visit(VectorAppender.java:119)
at
org.apache.arrow.vector.util.TestVectorSchemaRootAppender.testResizingBug(TestVectorSchemaRootAppender.java:68)
```
It looks like the issue is also present when appending other variable-length
vector types.
---
I'm happy to post a PR if this looks reasonable.
### Component(s)
Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]