[
https://issues.apache.org/jira/browse/ARROW-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209303#comment-17209303
]
Micah Kornfield commented on ARROW-10153:
-----------------------------------------
The main implications of using LargeVarChatVector by default are:
1. It has an 8-bytes instead of 4-byte overhead per string value.
2. It might not be supported in all Arrow implementations (I would need to
double check the matrix/integration tests).
There isn't anything built into java that will do the conversion automatically.
You could probably determine this yourself via accessors on the vectors (I
think getByteCapacity, etc). Although you would potentially run into other
problems trying to copy values that are close to 2GB in size from one vector
another (you would have a pretty high peak off-heap memory usage).
> [Java] Adding values to VarCharVector beyond 2GB results in
> IndexOutOfBoundsException
> -------------------------------------------------------------------------------------
>
> Key: ARROW-10153
> URL: https://issues.apache.org/jira/browse/ARROW-10153
> Project: Apache Arrow
> Issue Type: Bug
> Components: Java
> Affects Versions: 1.0.0
> Reporter: Samarth Jain
> Priority: Major
>
> On executing the below test case, one can see that on adding the 2049th
> string of size 1MB, it fails.
> {code:java}
> int length = 1024 * 1024;
> StringBuilder sb = new StringBuilder(length);
> for (int i = 0; i < length; i++) {
> sb.append("a");
> }
> byte[] str = sb.toString().getBytes();
> VarCharVector vector = new VarCharVector("v", new
> RootAllocator(Long.MAX_VALUE));
> vector.allocateNew(3000);
> for (int i = 0; i < 3000; i++) {
> vector.setSafe(i, str);
> }{code}
>
> {code:java}
> Exception in thread "main" java.lang.IndexOutOfBoundsException: index:
> -2147483648, length: 1048576 (expected: range(0, 2147483648))Exception in
> thread "main" java.lang.IndexOutOfBoundsException: index: -2147483648,
> length: 1048576 (expected: range(0, 2147483648)) at
> org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) at
> org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:762) at
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1212)
> at
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1011)
> {code}
> Stepping through the code,
>
> [https://github.com/apache/arrow/blob/master/java/memory/memory-core/src/main/java/org/apache/arrow/memory/ArrowBuf.java#L425]
> returns the negative index `-2147483648`
--
This message was sent by Atlassian Jira
(v8.3.4#803005)