ghrahul commented on issue #9129:
URL: https://github.com/apache/hudi/issues/9129#issuecomment-1621830109
@codope , we are using spark `v3.2.1`
When we run a spark job in our cluster we sometimes face this
`enableVectorizedReader` error.
`The Spark application has failed. Reason: Job aborted due to stage failure:
Task 13 in stage 105.0 failed 4 times, most recent failure: Lost task 13.3 in
stage 105.0 (TID 31060) (10.60.5.71 executor 12): java.lang.RuntimeException:
Cannot reserve additional contiguous bytes in the vectorized reader (integer
overflow). As a workaround, you can reduce the vectorized reader batch size, or
disable the vectorized reader, or disable spark.sql.sources.bucketing.enabled
if you read from bucket table. For Parquet file format, refer to
spark.sql.parquet.columnarReaderBatchSize (default 4096) and
spark.sql.parquet.enableVectorizedReader; for ORC file format, refer to
spark.sql.orc.columnarReaderBatchSize (default 4096) and
spark.sql.orc.enableVectorizedReader. at
org.apache.spark.sql.execution.vectorized.WritableColumnVector.throwUnsupportedException(WritableColumnVector.java:113)
at
org.apache.spark.sql.execution.vectorized.WritableColumnVector.reserve(WritableColumnVector.java:86)
at org.
apache.spark.sql.execution.vectorized.WritableColumnVector.appendBytes(WritableColumnVector.java:488)
at
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:507)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:338)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$BinaryUpdater.readValues(ParquetVectorUpdaterFactory.java:704)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBatchInternal(VectorizedRleValuesReader.java:230)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBatch(VectorizedRleValuesReader.java:171)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:227)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:298)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage15.columnartorow_nextBatch_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage15.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org
.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker
.run(ThreadPoolExecutor.java:628) at
java.base/java.lang.Thread.run(Thread.java:829) Driver stacktrace:`
To resolve this we set
`spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")` and this
works. But we have to set this conf in multiple sections of our pyspark script
because Hudi internally changes this setting. How can we avoid or make sure
that `spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")`
persists for a pyspark job/script and hudi does not overwrite it for the whole
job.
Thank You
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]