[
https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697999#comment-17697999
]
Apache Spark commented on SPARK-42715:
--------------------------------------
User 'chong0929' has created a pull request for this issue:
https://github.com/apache/spark/pull/40341
> NegativeArraySizeException by too many datas read from ORC file
> ---------------------------------------------------------------
>
> Key: SPARK-42715
> URL: https://issues.apache.org/jira/browse/SPARK-42715
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.3.2
> Reporter: XiaoLong Wu
> Priority: Minor
>
> If need more friendly exception msg about how to avoid this exception? Like
> when we catch this expetion, told user can reduce the value about
> spark.sql.orc.columnarReaderBatchSize;
> In the current version, for batch reading of orc files, we use the function
> OrcColumnarBatchReader.nextBatch() to do this and depends on
> [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in
> ORC relevant code is as follows:
> {code:java}
> private static byte[] commonReadByteArrays(InStream stream, IntegerReader
> lengths,
> LongColumnVector scratchlcv,
> BytesColumnVector result, final int batchSize) throws IOException {
> // Read lengths
> scratchlcv.isRepeating = result.isRepeating;
> scratchlcv.noNulls = result.noNulls;
> scratchlcv.isNull = result.isNull; // Notice we are replacing the isNull
> vector here...
> lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize);
> int totalLength = 0;
> if (!scratchlcv.isRepeating) {
> for (int i = 0; i < batchSize; i++) {
> if (!scratchlcv.isNull[i]) {
> totalLength += (int) scratchlcv.vector[i];
> }
> }
> } else {
> if (!scratchlcv.isNull[0]) {
> totalLength = (int) (batchSize * scratchlcv.vector[0]);
> }
> }
> // Read all the strings for this batch
> byte[] allBytes = new byte[totalLength];
> int offset = 0;
> int len = totalLength;
> while (len > 0) {
> int bytesRead = stream.read(allBytes, offset, len);
> if (bytesRead < 0) {
> throw new EOFException("Can't finish byte read from " + stream);
> }
> len -= bytesRead;
> offset += bytesRead;
> }
> return allBytes;
> } {code}
> As shown above, totalLength as a Long type param is used to mark the data
> size. If the data size too big to over max_int, converting to int will lead
> to value overflow and throws the following exception:
> {code:java}
> Caused by: java.lang.NegativeArraySizeException
> at
> org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998)
> at
> org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021)
> at
> org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119)
> at
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962)
> at
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
> at
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
> at
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
> at
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
> at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197)
> at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
> ... 20 more {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]