voonhous commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3426595575


##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -348,16 +352,32 @@ private HoodieSchema(Schema avroSchema, 
List<HoodieSchemaField> fields) {
     this.fields = fields != null ? Collections.unmodifiableList(fields) : null;
   }
 
+  // Avro schemas are interned by identity (records of one file share the same 
Schema instance), so the
+  // per-record fromAvroSchema() call is a cache hit that reuses the canonical 
HoodieSchema and its lazily
+  // built field list / field map instead of rebuilding an O(schema width) 
wrapper. Misses convert and
+  // value-intern through HoodieSchemaCache so equal-but-distinct Avro schema 
instances still converge on
+  // one canonical HoodieSchema. Global cache for the JVM lifecycle; weakKeys 
lets dead schemas be GC'd.
+  private static final LoadingCache<Schema, HoodieSchema> AVRO_SCHEMA_CACHE =
+      Caffeine.newBuilder().weakKeys().maximumSize(1024)
+          .build(avroSchema -> 
HoodieSchemaCache.intern(convertFromAvroSchema(avroSchema)));
+
   /**
-   * Factory method to create HoodieSchema from an Avro schema.
+   * Factory method to create a {@link HoodieSchema} from an Avro schema.
+   *
+   * <p>The result is interned: passing the same Avro {@link Schema} instance 
(e.g. once per record)
+   * returns the canonical {@link HoodieSchema} rather than rebuilding a fresh 
wrapper each call.
    *
    * @param avroSchema the Avro schema to wrap
-   * @return new HoodieSchema instance
+   * @return canonical HoodieSchema instance, or {@code null} if {@code 
avroSchema} is null
    */
   public static HoodieSchema fromAvroSchema(Schema avroSchema) {
     if (avroSchema == null) {
       return null;
     }
+    return AVRO_SCHEMA_CACHE.get(avroSchema);

Review Comment:
   After we shifted the code to use `AVRO_SCHEMA_CACHE` in `HoodieSchema`, the 
test:
   
`org.apache.hudi.TestHoodieSchemaUtils#testIllegalPromotionsBetweenPrimitives` 
is failing with the error:
   
   ```log
   769  [main] DEBUG 
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking 
compatibility of reader 
{"type":"record","name":"rec","fields":[{"name":"simpleField","type":"bytes"},{"name":"arrayField","type":{"type":"array","items":"bytes"}},{"name":"mapField","type":{"type":"map","values":"bytes"}},{"name":"nestedField","type":{"type":"record","name":"nestedField","fields":[{"name":"nested","type":"bytes"}]}}]}
 with writer 
{"type":"record","name":"rec","fields":[{"name":"simpleField","type":"int"},{"name":"arrayField","type":{"type":"array","items":"int"}},{"name":"mapField","type":{"type":"map","values":"int"}},{"name":"nestedField","type":{"type":"record","name":"nestedField","fields":[{"name":"nested","type":"int"}]}}]}
   775  [main] DEBUG 
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking 
compatibility of reader "bytes" with writer "int"
   777  [main] DEBUG 
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking 
compatibility of reader {"type":"array","items":"bytes"} with writer 
{"type":"array","items":"int"}
   777  [main] DEBUG 
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking 
compatibility of reader "bytes" with writer "int"
   777  [main] DEBUG 
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking 
compatibility of reader {"type":"map","values":"bytes"} with writer 
{"type":"map","values":"int"}
   777  [main] DEBUG 
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking 
compatibility of reader "bytes" with writer "int"
   777  [main] DEBUG 
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking 
compatibility of reader 
{"type":"record","name":"nestedField","fields":[{"name":"nested","type":"bytes"}]}
 with writer 
{"type":"record","name":"nestedField","fields":[{"name":"nested","type":"int"}]}
   777  [main] DEBUG 
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking 
compatibility of reader "bytes" with writer "int"
   
   org.opentest4j.AssertionFailedError: 
   Expected :true
   Actual   :false
   <Click to see difference>
   ```
   
   ## Why it breaks (it is not just code moving around)
   
   The change flips `HoodieSchema.fromAvroSchema` from returning a **fresh** 
wrapper on every call to returning a **shared, interned** instance. 
`HoodieSchemaCompatibilityChecker` depends on that per-call freshness, so its 
result changes even though the call sites look the same.
   
   The memo key, `ReaderWriter`, compares the `HoodieSchema` instances by 
**pointer identity**:
   
   ```java
   public int hashCode() { return System.identityHashCode(mReader) ^ 
System.identityHashCode(mWriter); }
   public boolean equals(Object obj) { ... return (this.mReader == 
that.mReader) && (this.mWriter == that.mWriter); }
   ```
   
   The checker memoizes results in `mMemoizeMap` keyed by `ReaderWriter`: it 
computes a `(reader, writer)` pair once, stores the result (with the field 
location captured on that first visit), and returns that same result for any 
later pair that is `==`.
   
   The test builds a reader record with `bytes` at four field locations 
(`simpleField`, `arrayField.element`, `mapField.value`, `nestedField.nested`) 
and a writer with `int` at the same four. Each `bytes` vs `int` pair is a 
`TYPE_MISMATCH` that must be reported with its own field path, and 
`TestHoodieSchemaUtils.java:167` asserts all four paths appear in the exception 
message.
   
   **Before the change:** `fromAvroSchema` returned a fresh `HoodieSchema` per 
call, so the `bytes` at `simpleField` and the `bytes` at `arrayField.element` 
were distinct instances. Four distinct `(bytes, int)` identity pairs, so 
`calculateCompatibility` runs four times, producing four incompatibilities (one 
per field path); the message lists all four and the test passes.
   
   **After the change:** `fromAvroSchema` interns, so all four `bytes` collapse 
to one canonical instance and all four `int` to one. Now all four locations are 
the same `(canonicalBytes, canonicalInt)` identity pair. The first one 
(`rec.simpleField`) is computed and memoized; the other three hit the memo and 
reuse that result without recording their own locations. The message ends up 
with only `rec.simpleField`, so the assertion fails for the other three.
   
   That is also why the debug log shows four "reader bytes with writer int" 
lines while the message is short: `log.debug` runs at the top of 
`getCompatibility` on every call, but `calculateCompatibility` (which records 
the location) only runs once; the other three short-circuit on the memo.
   
   **In short:** the checker's recursion memo assumes a distinct schema 
instance per occurrence (that is how it tells genuine recursion from a sibling 
subschema of the same shape). Interning inside `fromAvroSchema` makes 
structurally-equal sibling subschemas share identity, so the checker conflates 
them and under-reports. The interning is only needed on the per-record read 
path, not this one-shot compatibility check, so keeping it at the explicit 
`AvroToHoodieSchemaCache.intern` call sites avoids the problem.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to