jeguiguren-cohere commented on issue #6849:
URL: https://github.com/apache/hudi/issues/6849#issuecomment-1300984254
I encountered the same issue, and was able to consistently reproduce locally
by following what was described here;
I basically write the following document twice, and the second time it fails
with `Caused by: org.apache.avro.SchemaParseException: Can't redefine: array`
```
{
"clusterTime": 0,
"documentKey": {
"_id": "abc"
},
"arrayName": [
{
"foo": "bar"
}
],
"nestedPath": {
"anotherNestedPath": [
{
"arrayName": [
{
"foo": "bar"
}
]
}
]
}
}
```
Hudi configurations:
```
{
"path":
"file:////tmp/pytest-of-glue_user/pytest-0/test_schema_array_single_type0/hudi_output",
"connectionName": "spark3-1-hudi0-12-1-glue-connector",
"hoodie.table.name": "userTest",
"hoodie.datasource.write.recordkey.field": "documentKey._id",
"hoodie.datasource.write.precombine.field": "clusterTime",
"hoodie.datasource.write.reconcile.schema": "true",
"hoodie.schema.on.read.enable": "true",
"hoodie.index.type": "GLOBAL_SIMPLE",
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.NonpartitionedKeyGenerator"
}
```
Versions used:
```
Glue 3.0
Spark 3.1
Hudi 0.12.1
'spark.jars.packages','org.apache.hudi:hudi-spark3.1-bundle_2.12:0.12.1'
```
Full stack trace:
```
22/11/01 21:37:24 ERROR BaseSparkCommitActionExecutor: Error upserting
bucketType UPDATE for partition :0
org.apache.hudi.exception.HoodieException:
org.apache.avro.SchemaParseException: Can't redefine: array
at
org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
at
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
at
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.avro.SchemaParseException: Can't redefine: array
at org.apache.avro.Schema$Names.put(Schema.java:1128)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
at org.apache.avro.Schema.toString(Schema.java:324)
at org.apache.avro.Schema.toString(Schema.java:314)
at
org.apache.parquet.avro.AvroReadSupport.setAvroReadSchema(AvroReadSupport.java:69)
at
org.apache.hudi.io.storage.HoodieParquetReader.getRecordIterator(HoodieParquetReader.java:69)
at
org.apache.hudi.io.storage.HoodieFileReader.getRecordIterator(HoodieFileReader.java:43)
at
org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:132)
... 31 more
```
I noticed that this does NOT happen when running Hudi `0.11.1`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]