kazdy commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r973573340
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
.key("hoodie.datasource.write.reconcile.schema")
- .defaultValue(false)
+ .defaultValue(true)
Review Comment:
I'll add my 5 cents as a hudi user :)
afaik schema reconciliation was meant to apply latest table schema on the
incoming batch, but then if new batch contained new columns and some columns
were missing at the same time then its behaviour was that new column was added
but missing ones were dropped (at least on read, but physically still existing
in files).
I feel like in Hudi we could have mergeSchema option for both df write and
sql merge into (currently target table schema is applied) as in delta and now
iceberg or in parquet datasource in spark, which would behave same as if
reconciliation and schema evolution were enabled now. Then reconcile schema
could behave differently.
When we ingest data the team producing it not always inform me about the
changes and it would be nice to have a mechanism that can handle this.
Currently most hudi users I know just create uber schema and apply it to df
before write, sometimes it's hard to because of how the org we work for
functions.
for some context:
#5899 - for mergeSchame in MERGE INTO statement
#5873 and #5452 - issues with reconcile schema
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]