Davis Zhang created HUDI-8439:
---------------------------------
Summary: Table schema resolver issue
Key: HUDI-8439
URL: https://issues.apache.org/jira/browse/HUDI-8439
Project: Apache Hudi
Issue Type: Improvement
Reporter: Davis Zhang
related Jira https://issues.apache.org/jira/browse/HUDI-8219
Today how table schema resolver resolves table schema is:
* Find schema of commit metadata of the latest completed instants from commit,
delta commit, compaction, replacement in the active timeline.
* If not exists, try read table creation schema in hoodie.properties
* If not exists, read parquet file schema
In fact, after active timeline case we should poke into archive timeline to
search for table schema instead of going to the very initial version of table
creation schema. In concurrent schema evolution case, failing to read the
latest table schema info would leads to conflicting schema evolution being
allowed.
Example:
* the initial table schema is s1
* The latest table schema, as indicated by some complete instant in archived
timeline is s2 who has 1 new column compared to s1
* another writer use schema s3 who also has 1 new nullable column compared to
s1 (compatible with s1) but not compatible with s2.
Currently nothing stops this writer from making a commit as it is not aware of
s2 at all.
Proposed fix:
instead of poking into archived timeline, when archiver moved a bunch of
instants out of active timeline, it can make a special empty commit in the
active timeline, recording what's the latest table schema indicated by the ones
moving out of the active timeline.
Thus it is guaranteed that active timeline always has information about table
schema.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)