[
https://issues.apache.org/jira/browse/HUDI-8439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Y Ethan Guo updated HUDI-8439:
------------------------------
Fix Version/s: 1.1.0
> Table schema resolver issue
> ---------------------------
>
> Key: HUDI-8439
> URL: https://issues.apache.org/jira/browse/HUDI-8439
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Davis Zhang
> Priority: Major
> Fix For: 1.1.0
>
>
> related Jira https://issues.apache.org/jira/browse/HUDI-8219
>
> Today how table schema resolver resolves table schema is:
> * Find schema of commit metadata of the latest completed instants from
> commit, delta commit, compaction, replacement in the active timeline.
> * If not exists, try read table creation schema in hoodie.properties
> * If not exists, read parquet file schema
>
> In fact, after active timeline case we should poke into archive timeline to
> search for table schema instead of going to the very initial version of table
> creation schema. In concurrent schema evolution case, failing to read the
> latest table schema info would leads to conflicting schema evolution being
> allowed.
>
> Example:
> * the initial table schema is s1
> * The latest table schema, as indicated by some complete instant in archived
> timeline is s2 who has 1 new column compared to s1
> * another writer use schema s3 who also has 1 new nullable column compared
> to s1 (compatible with s1) but not compatible with s2.
> Currently nothing stops this writer from making a commit as it is not aware
> of s2 at all.
>
> Proposed fix:
> instead of poking into archived timeline, when archiver moved a bunch of
> instants out of active timeline, it can make a special empty commit in the
> active timeline, recording what's the latest table schema indicated by the
> ones moving out of the active timeline.
>
> Thus it is guaranteed that active timeline always has information about table
> schema.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)