voonhous commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3426595575
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -348,16 +352,32 @@ private HoodieSchema(Schema avroSchema,
List<HoodieSchemaField> fields) {
this.fields = fields != null ? Collections.unmodifiableList(fields) : null;
}
+ // Avro schemas are interned by identity (records of one file share the same
Schema instance), so the
+ // per-record fromAvroSchema() call is a cache hit that reuses the canonical
HoodieSchema and its lazily
+ // built field list / field map instead of rebuilding an O(schema width)
wrapper. Misses convert and
+ // value-intern through HoodieSchemaCache so equal-but-distinct Avro schema
instances still converge on
+ // one canonical HoodieSchema. Global cache for the JVM lifecycle; weakKeys
lets dead schemas be GC'd.
+ private static final LoadingCache<Schema, HoodieSchema> AVRO_SCHEMA_CACHE =
+ Caffeine.newBuilder().weakKeys().maximumSize(1024)
+ .build(avroSchema ->
HoodieSchemaCache.intern(convertFromAvroSchema(avroSchema)));
+
/**
- * Factory method to create HoodieSchema from an Avro schema.
+ * Factory method to create a {@link HoodieSchema} from an Avro schema.
+ *
+ * <p>The result is interned: passing the same Avro {@link Schema} instance
(e.g. once per record)
+ * returns the canonical {@link HoodieSchema} rather than rebuilding a fresh
wrapper each call.
*
* @param avroSchema the Avro schema to wrap
- * @return new HoodieSchema instance
+ * @return canonical HoodieSchema instance, or {@code null} if {@code
avroSchema} is null
*/
public static HoodieSchema fromAvroSchema(Schema avroSchema) {
if (avroSchema == null) {
return null;
}
+ return AVRO_SCHEMA_CACHE.get(avroSchema);
Review Comment:
After we shifted the code to use `AVRO_SCHEMA_CACHE` in `HoodieSchema`, the
test:
`org.apache.hudi.TestHoodieSchemaUtils#testIllegalPromotionsBetweenPrimitives`
is failing with the error:
```log
769 [main] DEBUG
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking
compatibility of reader
{"type":"record","name":"rec","fields":[{"name":"simpleField","type":"bytes"},{"name":"arrayField","type":{"type":"array","items":"bytes"}},{"name":"mapField","type":{"type":"map","values":"bytes"}},{"name":"nestedField","type":{"type":"record","name":"nestedField","fields":[{"name":"nested","type":"bytes"}]}}]}
with writer
{"type":"record","name":"rec","fields":[{"name":"simpleField","type":"int"},{"name":"arrayField","type":{"type":"array","items":"int"}},{"name":"mapField","type":{"type":"map","values":"int"}},{"name":"nestedField","type":{"type":"record","name":"nestedField","fields":[{"name":"nested","type":"int"}]}}]}
775 [main] DEBUG
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking
compatibility of reader "bytes" with writer "int"
777 [main] DEBUG
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking
compatibility of reader {"type":"array","items":"bytes"} with writer
{"type":"array","items":"int"}
777 [main] DEBUG
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking
compatibility of reader "bytes" with writer "int"
777 [main] DEBUG
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking
compatibility of reader {"type":"map","values":"bytes"} with writer
{"type":"map","values":"int"}
777 [main] DEBUG
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking
compatibility of reader "bytes" with writer "int"
777 [main] DEBUG
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking
compatibility of reader
{"type":"record","name":"nestedField","fields":[{"name":"nested","type":"bytes"}]}
with writer
{"type":"record","name":"nestedField","fields":[{"name":"nested","type":"int"}]}
777 [main] DEBUG
org.apache.hudi.common.schema.HoodieSchemaCompatibilityChecker [] - Checking
compatibility of reader "bytes" with writer "int"
org.opentest4j.AssertionFailedError:
Expected :true
Actual :false
<Click to see difference>
```
## Why it breaks (it is not just code moving around)
The change flips `HoodieSchema.fromAvroSchema` from returning a **fresh**
wrapper on every call to returning a **shared, interned** instance.
`HoodieSchemaCompatibilityChecker` depends on that per-call freshness, so its
result changes even though the call sites look the same.
The memo key, `ReaderWriter`, compares the `HoodieSchema` instances by
**pointer identity**:
```java
public int hashCode() { return System.identityHashCode(mReader) ^
System.identityHashCode(mWriter); }
public boolean equals(Object obj) { ... return (this.mReader ==
that.mReader) && (this.mWriter == that.mWriter); }
```
The checker memoizes results in `mMemoizeMap` keyed by `ReaderWriter`: it
computes a `(reader, writer)` pair once, stores the result (with the field
location captured on that first visit), and returns that same result for any
later pair that is `==`.
The test builds a reader record with `bytes` at four field locations
(`simpleField`, `arrayField.element`, `mapField.value`, `nestedField.nested`)
and a writer with `int` at the same four. Each `bytes` vs `int` pair is a
`TYPE_MISMATCH` that must be reported with its own field path, and
`TestHoodieSchemaUtils.java:167` asserts all four paths appear in the exception
message.
**Before the change:** `fromAvroSchema` returned a fresh `HoodieSchema` per
call, so the `bytes` at `simpleField` and the `bytes` at `arrayField.element`
were distinct instances. Four distinct `(bytes, int)` identity pairs, so
`calculateCompatibility` runs four times, producing four incompatibilities (one
per field path); the message lists all four and the test passes.
**After the change:** `fromAvroSchema` interns, so all four `bytes` collapse
to one canonical instance and all four `int` to one. Now all four locations are
the same `(canonicalBytes, canonicalInt)` identity pair. The first one
(`rec.simpleField`) is computed and memoized; the other three hit the memo and
reuse that result without recording their own locations. The message ends up
with only `rec.simpleField`, so the assertion fails for the other three.
That is also why the debug log shows four "reader bytes with writer int"
lines while the message is short: `log.debug` runs at the top of
`getCompatibility` on every call, but `calculateCompatibility` (which records
the location) only runs once; the other three short-circuit on the memo.
**In short:** the checker's recursion memo assumes a distinct schema
instance per occurrence (that is how it tells genuine recursion from a sibling
subschema of the same shape). Interning inside `fromAvroSchema` makes
structurally-equal sibling subschemas share identity, so the checker conflates
them and under-reports. The interning is only needed on the per-record read
path, not this one-shot compatibility check, so keeping it at the explicit
`AvroToHoodieSchemaCache.intern` call sites avoids the problem.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]