Davis Zhang created HUDI-8439:
---------------------------------

             Summary: Table schema resolver issue
                 Key: HUDI-8439
                 URL: https://issues.apache.org/jira/browse/HUDI-8439
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Davis Zhang


related Jira https://issues.apache.org/jira/browse/HUDI-8219

 

Today how table schema resolver resolves table schema is:
 * Find schema of commit metadata of the latest completed instants from commit, 
delta commit, compaction, replacement in the active timeline.
 * If not exists, try read table creation schema in hoodie.properties
 * If not exists, read parquet file schema

 

In fact, after active timeline case we should poke into archive timeline to 
search for table schema instead of going to the very initial version of table 
creation schema. In concurrent schema evolution case, failing to read the 
latest table schema info would leads to conflicting schema evolution being 
allowed.

 

Example:
 * the initial table schema is s1
 * The latest table schema, as indicated by some complete instant in archived 
timeline is s2 who has 1 new column compared to s1
 * another writer use schema s3 who also has 1 new nullable column compared to 
s1 (compatible with s1) but not compatible with s2.

Currently nothing stops this writer from making a commit as it is not aware of 
s2 at all.

 

Proposed fix:

instead of poking into archived timeline, when archiver moved a bunch of 
instants out of active timeline, it can make a special empty commit in the 
active timeline, recording what's the latest table schema indicated by the ones 
moving out of the active timeline.

 

Thus it is guaranteed that active timeline always has information about table 
schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to