WilliamWhispell commented on issue #2265:
URL: https://github.com/apache/hudi/issues/2265#issuecomment-830465596
Then on the above, if we start pyspark with --conf
'spark.hadoop.parquet.avro.write-old-list-structure=false'
when we write hudi with the same steps as above we get:
>>>
spark_df.write.format("org.apache.hudi").mode("append").option("hoodie.table.name",
"apple").option("hoodie.datasource.write.precombine.field",
"hudi_key").option("hoodie.datasource.write.recordkey.field",
"hudi_key").option("spark.hadoop.parquet.avro.write-old-list-structure",
"false").option("parquet.avro.write-old-list-structure",
"false").option("hoodie.parquet.avro.write-old-list-structure",
"false").save("/home/jovyan/apple.parquet")
21/04/30 23:54:57 ERROR HoodieCreateHandle: Error writing record
HoodieRecord{key=HoodieKey { recordKey=5 partitionPath=default},
currentLocation='null', newLocation='null'}
java.lang.ClassCastException: repeated binary array (UTF8) is not a group
at org.apache.parquet.schema.Type.asGroupType(Type.java:207)
at
org.apache.parquet.avro.AvroWriteSupport$ThreeLevelListWriter.writeCollection(AvroWriteSupport.java:610)
at
org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:395)
at
org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)
at
org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
at
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
at
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
at
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
at
org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
at
org.apache.hudi.io.storage.HoodieParquetWriter.writeAvroWithMetadata(HoodieParquetWriter.java:83)
at
org.apache.hudi.io.HoodieCreateHandle.write(HoodieCreateHandle.java:118)
at
org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:163)
at
org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:96)
at
org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:40)
at
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
at
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
21/04/30 23:54:57 ERROR HoodieCreateHandle: Error writing record
HoodieRecord{key=HoodieKey { recordKey=1 partitionPath=default},
currentLocation='null', newLocation='null'}
java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
at
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.startGroup(MessageColumnIO.java:395)
at
org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:393)
at
org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)
at
org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
at
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
....
So it is looking like even without nulls in an array, we cannot get the
3-level list format?
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]