David Palmer created PARQUET-2339:
-------------------------------------
Summary: ArrayIndexOutOfBounds exception writing parquet from Avro
in Apache Hudi
Key: PARQUET-2339
URL: https://issues.apache.org/jira/browse/PARQUET-2339
Project: Parquet
Issue Type: Bug
Components: parquet-avro, parquet-mr
Affects Versions: 1.12.3
Environment: Amazon EMR 6.12.x, Apache Hudi 0.13.1, Apache Spark
3.4.0, Linux in Docker
Reporter: David Palmer
While writing an Apache Hudi table using the DeltaStreamer utility, I receive
an exception from the Parquet `AvroWriteSupport` class:
```23/08/17 22:43:50 ERROR HoodieCreateHandle: Error writing record
HoodieRecord\{key=HoodieKey {
recordKey=id:05a3065f8cf0494f9dc449307a0fddd8,idx:01
partitionPath=event.year=2023/event.month=08/event.day=17/event.hour=22},
currentLocation='null', newLocation='null'}
java.lang.ArrayIndexOutOfBoundsException: Index 5 out of bounds for length 5
at
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:476)
~[parquet-column-1.12.3-amzn-0.jar:1.12.3-amzn-0]
at
org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:358)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:287)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:200)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:174)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
~[parquet-hadoop-1.12.3-amzn-0.jar:1.12.3-amzn-0]
at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:310)
~[parquet-hadoop-1.12.3-amzn-0.jar:1.12.3-amzn-0]
at
org.apache.hudi.io.storage.HoodieBaseParquetWriter.write(HoodieBaseParquetWriter.java:80)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.io.storage.HoodieAvroParquetWriter.writeAvroWithMetadata(HoodieAvroParquetWriter.java:67)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.io.storage.HoodieAvroFileWriter.writeWithMetadata(HoodieAvroFileWriter.java:45)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.io.storage.HoodieFileWriter.writeWithMetadata(HoodieFileWriter.java:39)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.io.HoodieCreateHandle.doWrite(HoodieCreateHandle.java:147)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:175)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:98)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:42)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:67)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
at
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
~[scala-library-2.12.15.jar:?]
at
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:326)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.scheduler.Task.run(Task.scala:141)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
~[?:?]
at java.lang.Thread.run(Thread.java:833) ~[?:?]
```
I have tried setting `spark.hadoop.parquet.avro.write-old-list-structure:
false` but the issue persists.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)