[GitHub] [hudi] kazdy commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

GitBox Sat, 17 Sep 2022 04:10:32 -0700


kazdy commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r973573340



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   I'll add my 5 cents as a hudi user :)
   afaik schema reconciliation was meant to apply latest table schema on the 
incoming batch, but then if new batch contained new columns and some columns 
were missing at the same time then its behaviour was that new column was added 
but missing ones were dropped (at least on read, but physically still existing 
in files).
   
   I feel like in Hudi we could have mergeSchema option for both df write and 
sql merge into (currently target table schema is applied) as in delta and now 
iceberg or in parquet datasource in spark, which would behave same as if 
reconciliation and schema evolution were enabled now. Then reconcile schema 
could behave differently.
   
   When we ingest data the team producing it not always inform me about the 
changes and it would be nice to have a mechanism that can handle this. 
Currently most hudi users I know just create uber schema and apply it to df 
before write, sometimes it's hard to because of how the org we work for 
functions.
   
   for some context:
   #5899 - for mergeSchame in MERGE INTO statement
   #5873 and #5452 - issues with reconcile schema



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kazdy commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Reply via email to