Davis-Zhang-Onehouse commented on code in PR #12776:
URL: https://github.com/apache/hudi/pull/12776#discussion_r1945591684
##########
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/realtime/TestHoodieRealtimeRecordReader.java:
##########
@@ -619,21 +622,19 @@ public void
testSchemaEvolutionAndRollbackBlockInLastLogFile(ExternalSpillableMa
@Test
public void testSchemaEvolution() throws Exception {
- ExternalSpillableMap.DiskMapType diskMapType =
ExternalSpillableMap.DiskMapType.BITCASK;
- boolean isCompressionEnabled = true;
+ HoodieTableMetaClient metaClient =
HoodieTestUtils.init(basePath.toString(), HoodieTableType.MERGE_ON_READ);
+ HoodieTestTable table = HoodieTestTable.of(metaClient);
Review Comment:
In RFC 82 there are 2 critical parts:
- For all writers (excluding table services), at validation phase they
resolve the writer schema and table schema at that time and write the resolved
"Table schema" in the schema field of commit metadata.
- For anyone who would like to know what is the table schema given the
timeline, we need to focus on commits from writers who writes table schema in
their commit metadata, that's why we need extra filtering over those instants.
This PR address the second point as today's table schema resolver does not
work properly in multi-writer scenario and
https://github.com/apache/hudi/pull/12781 addresses the first point.
> don't want to introduce too much overhead for it.
Yes we are on the same page and the code change keeps this in mind. Let me
explain:
- Do we need a new table schema resolver: The answer is "Yes" - The old
table schema resolver does not deliver the guarantee that it returns the "table
schema" in multi-writer scenario, it just returns whatever the writer schema in
some recent commit metadata it found. If we want to implement RFC 82, the table
schema resolver is a must have.
- Can we limit the usage of the new schema resolver to just relevant code
path: "We can but why maintain 2 sets of code if 1 completely out perform the
other in both performance, test coverage and correctness"? We have clean Java
CI as I completely deprecated the old one, I could not justify keeping the old
code.
> we are planning to introduce new schema abstractions
There is no contradiction - if the table schema resolver class needs to be
deprecated when new things comes, let's go ahead and deprecate. Before that, we
have to live with this class and make necessary fixes to deliver RFC 82.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]