[
https://issues.apache.org/jira/browse/ARROW-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Farmer reassigned ARROW-17338:
-----------------------------------
Assignee: Todd Farmer
> [Java] The maximum request memory of BaseVariableWidthVector should limit to
> Interger.MAX_VALUE
> -----------------------------------------------------------------------------------------------
>
> Key: ARROW-17338
> URL: https://issues.apache.org/jira/browse/ARROW-17338
> Project: Apache Arrow
> Issue Type: Bug
> Components: Java
> Reporter: Xianyang Liu
> Assignee: Todd Farmer
> Priority: Major
> Labels: pull-request-available
> Time Spent: 2h
> Remaining Estimate: 0h
>
> We got a IndexOutOfBoundsException:
> ```
> 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING,
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure:
> Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49):
> java.lang.IndexOutOfBoundsException: index: 2147312542, length: 777713
> (expected: range(0, 2147483648))
> at
> org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
> at
> org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826)
> at
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418)
> at
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235)
> at
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353)
> at
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161)
> at
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191)
> at
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74)
> at
> org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158)
> at
> org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51)
> at
> org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35)
> at
> org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134)
> at
> org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98)
> at
> org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
> at
> org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> ```
> The root cause is the following code of `BaseVariableWidthVector.handleSafe`
> could fail to relocated because of int overflow and then led to
> `IndexOutOfBoundsException` when we put the data into the vector.
> ```java
> protected final void handleSafe(int index, int dataLength) {
> while (index >= getValueCapacity()) {
> reallocValidityAndOffsetBuffers();
> }
> final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1);
> // startOffset + dataLength could overflow
> while (valueBuffer.capacity() < (startOffset + dataLength)) {
> reallocDataBuffer();
> }
> }
> ```
> The offset width of `BaseVariableWidthVector` is 4, while the maximum memory
> allocation is Long.MAX_VALUE. This makes the memory allocation check invalid.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)