danny0405 commented on code in PR #18885:
URL: https://github.com/apache/hudi/pull/18885#discussion_r3426330609
##########
hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java:
##########
@@ -94,12 +96,45 @@ public static HoodieCommitMetadata
buildMetadata(List<HoodieWriteStat> writeStat
if (extraMetadata.isPresent()) {
extraMetadata.get().forEach(commitMetadata::addMetadata);
}
- commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY,
(schemaToStoreInCommit == null || schemaToStoreInCommit.equals(NULL_SCHEMA_STR))
- ? "" : schemaToStoreInCommit);
+ commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY,
+ sanitizeSchemaForCommitMetadata(schemaToStoreInCommit));
commitMetadata.setOperationType(operationType);
return commitMetadata;
}
+ /**
+ * Returns the value to persist under {@link
HoodieCommitMetadata#SCHEMA_KEY}.
+ * The schema stored in commit extraMetadata must be the user/write schema
and
+ * must NOT contain Hudi meta fields ({@code _hoodie_commit_time}, etc.). If
+ * the caller-provided schema has meta fields (e.g. because some upstream
code
+ * mutated the in-memory write config schema with
reader-schema-with-meta-fields,
+ * or because a previously-polluted SCHEMA_KEY was read back into the
config),
+ * this strips them so the persisted schema is always clean. When no meta
fields
+ * are present, the input string is returned unchanged.
+ */
+ public static String sanitizeSchemaForCommitMetadata(String
schemaToStoreInCommit) {
Review Comment:
> because some upstream code
> * mutated the in-memory write config schema with
reader-schema-with-meta-fields,
> * or because a previously-polluted SCHEMA_KEY was read back into the
config
It more looks like a mistaken usage from users instead of Hudi, if user
already declare metadata fields as part of their shema, shouldn't we keep it
consistent in the commit metadata too?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]