Tyler Jackson created HUDI-2323:
-----------------------------------

             Summary: Upsert of Case Class with single field causes 
SchemaParseException
                 Key: HUDI-2323
                 URL: https://issues.apache.org/jira/browse/HUDI-2323
             Project: Apache Hudi
          Issue Type: Bug
          Components: Spark Integration, Storage Management
    Affects Versions: 0.8.0
            Reporter: Tyler Jackson
         Attachments: HudiSchemaGenerationTest.scala

Additional background information:

Spark version 3.1.1
Scala version 2.12
Hudi version 0.8.0 (hudi-spark-bundle_2.12 artifact)

 

While testing a spark job in EMR of inserting and then upserting data for a 
fairly complex nested case class structure, I ran into an issue that I was 
having a hard time tracking down. It seems when part of the case class in the 
dataframe to be written has a single field in it, the avro schema generation 
fails with the following stacktrace, but only on the upsert:



{{21/08/19 15:08:34 ERROR BoundedInMemoryExecutor: error producing records}}
{{org.apache.avro.SchemaParseException: Can't redefine: array}}
{{ at org.apache.avro.Schema$Names.put(Schema.java:1128)}}
{{ at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)}}
{{ at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)}}
{{ at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)}}
{{ at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)}}
{{ at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)}}
{{ at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)}}
{{ at org.apache.avro.Schema.toString(Schema.java:324)}}
{{ at 
org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)}}
{{ at 
org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)}}
{{ at 
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:475)}}
{{ at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)}}
{{ at 
org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)}}
{{ at 
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)}}
{{ at 
org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)}}
{{ at 
org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)}}
{{ at 
org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)}}
{{ at 
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)}}
{{ at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)}}
{{ at 
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)}}
{{ at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)}}
{{ at 
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)}}
{{ at 
org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)}}
{{ at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)}}
{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
{{ at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}
{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{ at java.lang.Thread.run(Thread.java:748)}}

 

I am able to replicate the problem in my local IntelliJ setup using the test 
that has been attached to this issue. The problem can be observed in the 
DummyStepParent case class. Simply adding an additional field to the case class 
eliminates the problem altogether (which is an acceptable workaround for our 
purposes, but shouldn't ultimately be necessary).

{{case class DummyObject (}}
{{     fieldOne: String,}}
{{     listTwo: Seq[String],}}
{{     listThree: Seq[DummyChild],}}
{{     listFour: Seq[DummyStepChild],}}
{{     fieldFive: Boolean,}}
{{     listSix: Seq[DummyParent],}}
{{     listSeven: Seq[DummyCousin],}}
{{     {color:#de350b}listEight: Seq[DummyStepParent]{color}}}
{{ )}}
{{case class DummyChild(childFieldOne: String, childFieldTwo: Int)}}
{{case class DummyStepChild(stepChildFieldOne: String, stepChildFieldTwo: 
Boolean)}}
{{case class DummyParent(children: Seq[DummyChild], stepChildren: 
Seq[DummyStepChild])}}
{{}}{{{color:#de350b}case class DummyStepParent(children: 
Seq[DummyChild]){color}}}
{{case class DummyCousin(cousinFieldOne: String, cousinFieldTwo: 
Seq[DummyChild])}}{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to