parisni commented on issue #9005:
URL: https://github.com/apache/hudi/issues/9005#issuecomment-1595861748

   After some investigations, the legacy merger also fails in this particular 
case (see below log). For both, the reason is the new column added has 
`nullable=false`. Moreover, when turning `hoodie.avro.schema.validate=true` 
then it will fail early with error: `Incoming batch schema is not compatible 
with the table's one`.
   
   Then the NPE is likely not a real issue (but hard to understand for end 
user): we try to write a null in the new column for previous records in a not 
null column.
   
   Indeed, when the new column has `nullable=true`, then all goes fine with 
schema evolution. For example by inserting `df =spark.sql("select '2' as 
event_id, '2' as ts, '3' as version, 'foo' as event_date, case when 1=1 then 
'foo' else null end as add_col")`
   
   What about providing the hability to turn the incomming schema in 
`nullable=true` automatically in hudi to avoid such errors ? @nsivabalan @yihua 
   
   ```
   Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
old record into new file for key event_id:1 from old file 
/tmp/test_hudi_merger/version=3/event_date=foo/17d0126f-b7a8-4b7f-95e7-1e65a8f36e8d-0_0-83-83_20230617230002764.parquet
 to new file 
/tmp/test_hudi_merger/version=3/event_date=foo/17d0126f-b7a8-4b7f-95e7-1e65a8f36e8d-0_0-91-91_20230617230003209.parquet
 with writerSchema {
     "type" : "record",
     "name" : "test_hudi_merger_record",
     "namespace" : "hoodie.test_hudi_merger",
     "fields" : [ {
       "name" : "_hoodie_commit_time",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_commit_seqno",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_record_key",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_partition_path",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_file_name",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "event_id",
       "type" : "string"
     }, {
       "name" : "ts",
       "type" : "string"
     }, {
       "name" : "version",
       "type" : "string"
     }, {
       "name" : "event_date",
       "type" : "string"
     }, {
       "name" : "add_col",
       "type" : "string"
     } ]
   }
           at 
org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:370)
           at 
org.apache.hudi.table.action.commit.BaseMergeHelper$UpdateHandler.consume(BaseMergeHelper.java:54)
           at 
org.apache.hudi.table.action.commit.BaseMergeHelper$UpdateHandler.consume(BaseMergeHelper.java:44)
           at 
org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:67)
           ... 33 more
   Caused by: java.lang.RuntimeException: Null-value for required field: add_col
           at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:200)
           at 
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:171)
           at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
           at 
org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:310)
           at 
org.apache.hudi.io.storage.HoodieBaseParquetWriter.write(HoodieBaseParquetWriter.java:80)
           at 
org.apache.hudi.io.storage.HoodieAvroParquetWriter.writeAvro(HoodieAvroParquetWriter.java:76)
           at 
org.apache.hudi.io.storage.HoodieAvroFileWriter.write(HoodieAvroFileWriter.java:51)
           at 
org.apache.hudi.io.storage.HoodieFileWriter.write(HoodieFileWriter.java:43)
           at 
org.apache.hudi.io.HoodieMergeHandle.writeToFile(HoodieMergeHandle.java:384)
           at 
org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:365)
           ... 36 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to