[jira] [Updated] (HUDI-8439) Table schema resolver issue

Y Ethan Guo (Jira) Fri, 25 Oct 2024 20:53:18 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-8439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Y Ethan Guo updated HUDI-8439:
------------------------------
    Fix Version/s: 1.1.0

> Table schema resolver issue
> ---------------------------
>
>                 Key: HUDI-8439
>                 URL: https://issues.apache.org/jira/browse/HUDI-8439
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Davis Zhang
>            Priority: Major
>             Fix For: 1.1.0
>
>
> related Jira https://issues.apache.org/jira/browse/HUDI-8219
>  
> Today how table schema resolver resolves table schema is:
>  * Find schema of commit metadata of the latest completed instants from 
> commit, delta commit, compaction, replacement in the active timeline.
>  * If not exists, try read table creation schema in hoodie.properties
>  * If not exists, read parquet file schema
>  
> In fact, after active timeline case we should poke into archive timeline to 
> search for table schema instead of going to the very initial version of table 
> creation schema. In concurrent schema evolution case, failing to read the 
> latest table schema info would leads to conflicting schema evolution being 
> allowed.
>  
> Example:
>  * the initial table schema is s1
>  * The latest table schema, as indicated by some complete instant in archived 
> timeline is s2 who has 1 new column compared to s1
>  * another writer use schema s3 who also has 1 new nullable column compared 
> to s1 (compatible with s1) but not compatible with s2.
> Currently nothing stops this writer from making a commit as it is not aware 
> of s2 at all.
>  
> Proposed fix:
> instead of poking into archived timeline, when archiver moved a bunch of 
> instants out of active timeline, it can make a special empty commit in the 
> active timeline, recording what's the latest table schema indicated by the 
> ones moving out of the active timeline.
>  
> Thus it is guaranteed that active timeline always has information about table 
> schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-8439) Table schema resolver issue

Reply via email to