parisni commented on issue #9005:
URL: https://github.com/apache/hudi/issues/9005#issuecomment-1595861748
After some investigations, the legacy merger also fails in this particular
case (see below log). For both, the reason is the new column added has
`nullable=false`. Moreover, when turning `hoodie.avro.schema.validate=true`
then it will fail early with error: `Incoming batch schema is not compatible
with the table's one`.
Then the NPE is likely not a real issue (but hard to understand for end
user): we try to write a null in the new column for previous records in a not
null column.
Indeed, when the new column has `nullable=true`, then all goes fine with
schema evolution. For example by inserting `df =spark.sql("select '2' as
event_id, '2' as ts, '3' as version, 'foo' as event_date, case when 1=1 then
'foo' else null end as add_col")`
What about providing the hability to turn the incomming schema in
`nullable=true` automatically in hudi to avoid such errors ? @nsivabalan @yihua
```
Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge
old record into new file for key event_id:1 from old file
/tmp/test_hudi_merger/version=3/event_date=foo/17d0126f-b7a8-4b7f-95e7-1e65a8f36e8d-0_0-83-83_20230617230002764.parquet
to new file
/tmp/test_hudi_merger/version=3/event_date=foo/17d0126f-b7a8-4b7f-95e7-1e65a8f36e8d-0_0-91-91_20230617230003209.parquet
with writerSchema {
"type" : "record",
"name" : "test_hudi_merger_record",
"namespace" : "hoodie.test_hudi_merger",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "event_id",
"type" : "string"
}, {
"name" : "ts",
"type" : "string"
}, {
"name" : "version",
"type" : "string"
}, {
"name" : "event_date",
"type" : "string"
}, {
"name" : "add_col",
"type" : "string"
} ]
}
at
org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:370)
at
org.apache.hudi.table.action.commit.BaseMergeHelper$UpdateHandler.consume(BaseMergeHelper.java:54)
at
org.apache.hudi.table.action.commit.BaseMergeHelper$UpdateHandler.consume(BaseMergeHelper.java:44)
at
org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:67)
... 33 more
Caused by: java.lang.RuntimeException: Null-value for required field: add_col
at
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:200)
at
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:171)
at
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
at
org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:310)
at
org.apache.hudi.io.storage.HoodieBaseParquetWriter.write(HoodieBaseParquetWriter.java:80)
at
org.apache.hudi.io.storage.HoodieAvroParquetWriter.writeAvro(HoodieAvroParquetWriter.java:76)
at
org.apache.hudi.io.storage.HoodieAvroFileWriter.write(HoodieAvroFileWriter.java:51)
at
org.apache.hudi.io.storage.HoodieFileWriter.write(HoodieFileWriter.java:43)
at
org.apache.hudi.io.HoodieMergeHandle.writeToFile(HoodieMergeHandle.java:384)
at
org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:365)
... 36 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]