[ 
https://issues.apache.org/jira/browse/HUDI-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4872:
----------------------------------
    Component/s: writer-core

> Support automatic schema evolution for SQL MERGE INTO with UPDATE */ INESRT *
> -----------------------------------------------------------------------------
>
>                 Key: HUDI-4872
>                 URL: https://issues.apache.org/jira/browse/HUDI-4872
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: writer-core
>            Reporter: kazdy
>            Assignee: Alexey Kudinkin
>            Priority: Major
>             Fix For: 0.13.0
>
>
> I tried using MERGE INTO with UPDATE * and INSERT * statement with full 
> schema evolution enabled.
> I noticed that during insert new columns from incoming batch (that do not 
> exist in target table yet) are dropped and target schema is applied. No 
> warnings nor failed writes.
> Therefore can we as users automatically evolve schema on MERGE INTO 
> operations?
> I guess this should only be supported when we use update set * and insert * 
> in merge operation.
> *Expected behavior*
> When incoming data is missing columns that already declared in target table 
> these should be injected with default/null values.
> When incoming data has new columns that are not yet declared in the target 
> table, these should be added to the target table.
> Case when incoming data has both missing columns and new columns, missing 
> columns should be injected with null/ default values, new columns should be 
> added to the target table.
> New columns should be reflected in metastore table schema.
> Should support complex types, and nested schemas.
> Currently similar thing is supported for dataframe writes if both schema 
> reconciliation and schema evolution configs are enabled, see HUDI-4276.
> From user experience perspective it would be easier if I had _mergeSchema_ 
> (as for parquet spark datasource) config to enable this feature for both 
> spark sql and df write.
> Thread from dev mailing list as a reference:
> [https://lists.apache.org/thread/kr59hh7yqr2c1y33kzfv3n97h6ydbz9b]
> GH issue: [https://github.com/apache/hudi/issues/5899]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to