sbernauer commented on issue #1806: URL: https://github.com/apache/hudi/issues/1806#issuecomment-655461841
I tracked the failing validation down to this line: https://github.com/apache/avro/blob/2c7b9af7d5ba35afe9cf84eae3b273a6df0612b1/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L615 It should not only check for `null` but also for `JsonProperties.NULL_VALUE` (the same bugfix that has to be done in https://github.com/apache/avro/blob/2c7b9af7d5ba35afe9cf84eae3b273a6df0612b1/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L886). So the `validate()`-method is fixed and doesn't complain any more. Deltastreamer accepts the record and tries to write it into parquet file, this is were it fails: ``` java.lang.ClassCastException: org.apache.avro.JsonProperties$Null cannot be cast to java.lang.Number at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:327) at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278) at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299) at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvroWithMetadata(HoodieParquetWriter.java:92) at org.apache.hudi.io.HoodieCreateHandle.write(HoodieCreateHandle.java:104) at org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:152) at org.apache.hudi.execution.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:145) at org.apache.hudi.execution.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:128) at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` Looking into the code https://github.com/apache/parquet-mr/blob/2589cc821d2d470be1e79b86f511eb1f5fee4e5c/parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java#L276 we can see, that parquet-mr will convert the `UNION{NULL, DOUBLE}` to `DOUBLE` and therefore the ClassCastException occurs. Also the doc of the `writeValue()`-method states "Value MUST not be null". Strange, i would say, because we can write `null` values (as you can see later on). So i tried another approach: Wy pass a `JsonProperties.NULL_VALUE` to avro when i could not handle it? This would also fix the patched avro 1.8.2 problem, so we dont need a patched avro that kind of supports `JsonProperties.NULL_VALUE.` The new approach is to pass `null` instead of `JsonProperties.NULL_VALUE` to avro when the default value is `JsonProperties.NULL_VALUE`. And yeah, this fixed the problem. I was ready to make a PR. Suddenly i noticed, that it already has been fixed in the master branch by [HUDI-803](https://issues.apache.org/jira/browse/HUDI-803) (commit https://github.com/apache/hudi/commit/6a0aa9a645d11ed7b50e18aa0563dafcd9d145f7) TLDR: It's already fixed by [HUDI-803](https://issues.apache.org/jira/browse/HUDI-803) in the master branch and will be in the next release. Cheers, Sebastian ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
