yihua commented on code in PR #13714:
URL: https://github.com/apache/hudi/pull/13714#discussion_r2272216581
##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/FileGroupReaderSchemaHandler.java:
##########
@@ -124,6 +132,18 @@ public DeleteContext getDeleteContext() {
return deleteContext;
}
+ public Pair<Schema, Map<String, String>>
getRequiredSchemaForFileAndRenamedColumns(StoragePath path) {
+ if (internalSchema.isEmptySchema()) {
+ return Pair.of(requiredSchema, Collections.emptyMap());
+ }
+ long commitInstantTime =
Long.parseLong(FSUtils.getCommitTime(path.getName()));
+ InternalSchema fileSchema =
InternalSchemaCache.searchSchemaAndCache(commitInstantTime, metaClient);
Review Comment:
Yes, the existing logic of schema evolution on read in other places follows
the same code logic, so this is OK in the sense that it brings feature parity
and does not introduce regression.
I think what makes more sense is to have a schema history (schemas for range
of completion/instant time, e.g., schema1: ts1-ts100, schema2: ts101-ts1000,
etc.) constructed on driver and distribute that to executors. This schema
history can be stored under `.hoodie` so one file read gets the whole schema
history and executor does not pay cost of scanning commit metadata or reading
schema from file (assuming that the file schema is based on the writer/table
schema of the commit). This essentially needs a new schema system /
abstraction, which is under the scope of RFC-88 @danny0405
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]