[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

GitBox Tue, 15 Nov 2022 00:53:18 -0800


xiarixiaoyao commented on code in PR #5830:
URL: https://github.com/apache/hudi/pull/5830#discussion_r1022504653



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseMergeHelper.java:
##########
@@ -130,4 +145,48 @@ protected Void getResult() {
       return null;
     }
   }
+
+  protected Iterator<GenericRecord> getRecordIterator(
+      HoodieTable<T, ?, ?, ?> table,
+      HoodieMergeHandle<T, ?, ?, ?> mergeHandle,
+      HoodieBaseFile baseFile,
+      HoodieFileReader<GenericRecord> reader,
+      Schema readSchema) throws IOException {
+    Option<InternalSchema> querySchemaOpt = 
SerDeHelper.fromJson(table.getConfig().getInternalSchema());
+    if (!querySchemaOpt.isPresent()) {
+      querySchemaOpt = new 
TableSchemaResolver(table.getMetaClient()).getTableInternalSchemaFromCommitMetadata();
+    }
+    boolean needToReWriteRecord = false;
+    Map<String, String> renameCols = new HashMap<>();
+    // TODO support bootstrap
+    if (querySchemaOpt.isPresent() && 
!baseFile.getBootstrapBaseFile().isPresent()) {

Review Comment:
   @trushev  
   can we avoid moved this code snippet, i donnot think flink evolution need to 
modify those codes.
   https://github.com/apache/hudi/pull/6358   and 
https://github.com/apache/hudi/pull/7183 will optimize this code
   
   @danny0405  
   we need check evolution for each base file.
   Once we have made multiple columns changes, different base files may have 
different schemas, and we cannot use the schema of the current table to read 
these files directly, an exception will be thrown directly
   
   tableA: a int, b string, c double and there exist three files in this table: 
f1, f2, f3
   
   drop column from tableA and add new column d, and then we update tableA, but 
we only update f2,and f3, f1 is not touched
   now schema
   ```
   schema1  from tableA: a int, b string, d long.  
   schema2  from f2,f3:  a int, b string, d long 
   schema3 from f1 is: a int, b string , c double
   ```
   we should not use schema1 to read f1.
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #5830: [HUDI-3981][RFC-33] Flink engine support for comprehensive schema evolution

Reply via email to