Re: [PR] [HUDI-8219] Introduce new Table schema resolver [hudi]

via GitHub Thu, 06 Feb 2025 15:32:22 -0800


Davis-Zhang-Onehouse commented on code in PR #12776:
URL: https://github.com/apache/hudi/pull/12776#discussion_r1945591684



##########
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/realtime/TestHoodieRealtimeRecordReader.java:
##########
@@ -619,21 +622,19 @@ public void 
testSchemaEvolutionAndRollbackBlockInLastLogFile(ExternalSpillableMa
 
   @Test
   public void testSchemaEvolution() throws Exception {
-    ExternalSpillableMap.DiskMapType diskMapType = 
ExternalSpillableMap.DiskMapType.BITCASK;
-    boolean isCompressionEnabled = true;
+    HoodieTableMetaClient metaClient = 
HoodieTestUtils.init(basePath.toString(), HoodieTableType.MERGE_ON_READ);
+    HoodieTestTable table = HoodieTestTable.of(metaClient);

Review Comment:
   In RFC 82 there are 2 critical parts:
   - For all writers (excluding table services), at validation phase they 
resolve the writer schema and table schema at that time and write the resolved 
"Table schema" in the schema field of commit metadata.
   - For anyone who would like to know what is the table schema given the 
timeline, we need to focus on commits from writers who writes table schema in 
their commit metadata, that's why we need extra filtering over those instants.
   
   This PR address the second point as today's table schema resolver does not 
work properly in multi-writer scenario and 
https://github.com/apache/hudi/pull/12781 addresses the first point.
   
   
   > don't want to introduce too much overhead for it.
   
   Yes we are on the same page and the code change keeps this in mind. Let me 
explain:
   - Do we need a new table schema resolver: The answer is "Yes" - The old 
table schema resolver does not deliver the guarantee that it returns the "table 
schema" in multi-writer scenario, it just returns whatever the writer schema in 
some recent commit metadata it found. If we want to implement RFC 82, the table 
schema resolver is a must have.
   - Can we limit the usage of the new schema resolver to just relevant code 
path: "We can but why maintain 2 sets of code if 1 completely out perform the 
other in both performance, test coverage and correctness"? We have clean Java 
CI as I completely deprecated the old one, I could not justify keeping the old 
code.
   
   > we are planning to introduce new schema abstractions
   
   There is no contradiction - if the table schema resolver class needs to be 
deprecated when new things comes, let's go ahead and deprecate. Before that, we 
have to live with this class and make necessary fixes to deliver RFC 82.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8219] Introduce new Table schema resolver [hudi]

Reply via email to