hudi-agent commented on code in PR #18885:
URL: https://github.com/apache/hudi/pull/18885#discussion_r3457415171


##########
hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java:
##########
@@ -94,12 +95,32 @@ public static HoodieCommitMetadata 
buildMetadata(List<HoodieWriteStat> writeStat
     if (extraMetadata.isPresent()) {
       extraMetadata.get().forEach(commitMetadata::addMetadata);
     }
-    commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY, 
(schemaToStoreInCommit == null || schemaToStoreInCommit.equals(NULL_SCHEMA_STR))
-        ? "" : schemaToStoreInCommit);
+    commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY,
+        sanitizeSchemaForCommitMetadata(schemaToStoreInCommit));
     commitMetadata.setOperationType(operationType);
     return commitMetadata;
   }
 
+  /**
+   * Returns the value to persist under {@link 
HoodieCommitMetadata#SCHEMA_KEY}.
+   * The schema stored in commit extraMetadata must be the user/write schema 
and
+   * must NOT contain Hudi meta fields ({@code _hoodie_commit_time}, etc.). If
+   * the caller-provided schema has meta fields (e.g. because some upstream 
code
+   * mutated the in-memory write config schema with 
reader-schema-with-meta-fields,
+   * or because a previously-polluted SCHEMA_KEY was read back into the 
config),
+   * this strips them so the persisted schema is always clean.
+   */
+  public static String sanitizeSchemaForCommitMetadata(String 
schemaToStoreInCommit) {
+    if (schemaToStoreInCommit == null || 
schemaToStoreInCommit.equals(NULL_SCHEMA_STR)) {
+      return "";
+    }
+    HoodieSchema schema = HoodieSchema.parse(schemaToStoreInCommit);

Review Comment:
   🤖 I can confirm this — `HoodieSchema.Parser.parse` (HoodieSchema.java:1663) 
does `checkArgument(jsonSchema != null && !jsonSchema.trim().isEmpty(), ...)`, 
so feeding `""` through the sanitizer throws `IllegalArgumentException`. This 
is the same defect as the null/NULL_SCHEMA regression noted above, so one 
early-return short-circuiting null, empty/whitespace, and NULL_SCHEMA_STR 
before `parse(...)` would resolve both at once.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java:
##########
@@ -272,6 +272,9 @@ protected void finalizeWrite(String instantTime, 
List<HoodieWriteStat> stats, Ho
 
   /**
    * By default, return the writer schema in Write Config for storing in 
commit.
+   * The returned schema is sanitized of Hudi meta fields by

Review Comment:
   🤖 nit: "The returned schema is sanitized" reads as though this method 
performs the stripping itself. Could you reword to make it clear the 
sanitization is downstream, e.g. "Subclasses overriding this need not strip 
Hudi meta fields — {@link CommitUtils#sanitizeSchemaForCommitMetadata(String)} 
handles that centrally when commit metadata is built."?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java:
##########
@@ -94,8 +92,7 @@ public static HoodieCommitMetadata 
buildMetadata(List<HoodieWriteStat> writeStat
     if (extraMetadata.isPresent()) {
       extraMetadata.get().forEach(commitMetadata::addMetadata);
     }
-    commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY, 
(schemaToStoreInCommit == null || schemaToStoreInCommit.equals(NULL_SCHEMA_STR))
-        ? "" : schemaToStoreInCommit);
+    commitMetadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY, 
schemaToStoreInCommit);

Review Comment:
   🤖 Confirmed against master — the conversion lives at 
`CommitUtils.buildMetadata` (lines 97-98): `(schemaToStoreInCommit == null || 
schemaToStoreInCommit.equals(NULL_SCHEMA_STR)) ? "" : schemaToStoreInCommit`. 
Dropping it means `"\"null\""` gets persisted verbatim and the resolver no 
longer throws. This is the same root cause as the empty-string issue flagged 
below, so a single early-return guarding null / empty-or-whitespace / 
NULL_SCHEMA_STR before parsing would cover both paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to