hudi-agent commented on code in PR #18885:
URL: https://github.com/apache/hudi/pull/18885#discussion_r3431468646


##########
hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java:
##########
@@ -94,12 +95,32 @@ public static HoodieCommitMetadata 
buildMetadata(List<HoodieWriteStat> writeStat
     if (extraMetadata.isPresent()) {
       extraMetadata.get().forEach(commitMetadata::addMetadata);
     }
-    commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY, 
(schemaToStoreInCommit == null || schemaToStoreInCommit.equals(NULL_SCHEMA_STR))
-        ? "" : schemaToStoreInCommit);
+    commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY,
+        sanitizeSchemaForCommitMetadata(schemaToStoreInCommit));
     commitMetadata.setOperationType(operationType);
     return commitMetadata;
   }
 
+  /**
+   * Returns the value to persist under {@link 
HoodieCommitMetadata#SCHEMA_KEY}.
+   * The schema stored in commit extraMetadata must be the user/write schema 
and
+   * must NOT contain Hudi meta fields ({@code _hoodie_commit_time}, etc.). If
+   * the caller-provided schema has meta fields (e.g. because some upstream 
code
+   * mutated the in-memory write config schema with 
reader-schema-with-meta-fields,
+   * or because a previously-polluted SCHEMA_KEY was read back into the 
config),
+   * this strips them so the persisted schema is always clean.
+   */
+  public static String sanitizeSchemaForCommitMetadata(String 
schemaToStoreInCommit) {
+    if (schemaToStoreInCommit == null || 
schemaToStoreInCommit.equals(NULL_SCHEMA_STR)) {
+      return "";
+    }
+    HoodieSchema schema = HoodieSchema.parse(schemaToStoreInCommit);

Review Comment:
   🤖 I think the new sanitizer regresses on empty-string input: 
`HoodieSchema.parse("")` fails its `ValidationUtils.checkArgument(jsonSchema != 
null && !jsonSchema.trim().isEmpty(), ...)` guard and throws 
`IllegalArgumentException`. The old ternary accepted `""` and stored it as-is, 
and several existing call sites pass empty string literally (e.g. 
`TestHoodieTableFSViewWithClustering.java:146`, 
`TestHoodieTableFileSystemView.java:2068/2240/2263/2282`). Could you extend the 
early-return to also short-circuit on empty/whitespace, e.g. `if 
(schemaToStoreInCommit == null || schemaToStoreInCommit.isEmpty() || 
schemaToStoreInCommit.equals(NULL_SCHEMA_STR)) return "";` before calling 
`parse(...)`?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java:
##########
@@ -94,12 +95,32 @@ public static HoodieCommitMetadata 
buildMetadata(List<HoodieWriteStat> writeStat
     if (extraMetadata.isPresent()) {
       extraMetadata.get().forEach(commitMetadata::addMetadata);
     }
-    commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY, 
(schemaToStoreInCommit == null || schemaToStoreInCommit.equals(NULL_SCHEMA_STR))
-        ? "" : schemaToStoreInCommit);
+    commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY,
+        sanitizeSchemaForCommitMetadata(schemaToStoreInCommit));
     commitMetadata.setOperationType(operationType);
     return commitMetadata;
   }
 
+  /**
+   * Returns the value to persist under {@link 
HoodieCommitMetadata#SCHEMA_KEY}.
+   * The schema stored in commit extraMetadata must be the user/write schema 
and
+   * must NOT contain Hudi meta fields ({@code _hoodie_commit_time}, etc.). If
+   * the caller-provided schema has meta fields (e.g. because some upstream 
code
+   * mutated the in-memory write config schema with 
reader-schema-with-meta-fields,
+   * or because a previously-polluted SCHEMA_KEY was read back into the 
config),
+   * this strips them so the persisted schema is always clean.
+   */
+  public static String sanitizeSchemaForCommitMetadata(String 
schemaToStoreInCommit) {

Review Comment:
   🤖 nit: `sanitize` is vague here — it doesn't tell the reader what's being 
removed. Something like `stripMetaFieldsForCommitPersistence` or 
`removeMetaFieldsFromSchema` would make the method's exact transformation 
self-documenting and easier to discover.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to