yihua commented on code in PR #13714:
URL: https://github.com/apache/hudi/pull/13714#discussion_r2272216581


##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/FileGroupReaderSchemaHandler.java:
##########
@@ -124,6 +132,18 @@ public DeleteContext getDeleteContext() {
     return deleteContext;
   }
 
+  public Pair<Schema, Map<String, String>> 
getRequiredSchemaForFileAndRenamedColumns(StoragePath path) {
+    if (internalSchema.isEmptySchema()) {
+      return Pair.of(requiredSchema, Collections.emptyMap());
+    }
+    long commitInstantTime = 
Long.parseLong(FSUtils.getCommitTime(path.getName()));
+    InternalSchema fileSchema = 
InternalSchemaCache.searchSchemaAndCache(commitInstantTime, metaClient);

Review Comment:
   Yes, the existing logic of schema evolution on read in other places follows 
the same code logic, so this is OK in the sense that it brings feature parity 
and does not introduce regression.
   
   I think what makes more sense is to have a schema history (schemas for range 
of completion/instant time, e.g., schema1: ts1-ts100, schema2: ts101-ts1000, 
etc.) constructed on driver and distribute that to executors.  This schema 
history can be stored under `.hoodie` so one file read gets the whole schema 
history and executor does not pay cost of scanning commit metadata or reading 
schema from file (assuming that the file schema is based on the writer/table 
schema of the commit). This essentially needs a new schema system / 
abstraction, which is under the scope of RFC-88 @danny0405   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to