hudi-agent commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3408525827
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -313,8 +313,10 @@ private static void addVectorColumnName(String s, int
start, int end, Set<String
private Schema avroSchema;
private HoodieSchemaType type;
- private transient List<HoodieSchemaField> fields;
- private transient Map<String, HoodieSchemaField> fieldMap;
+ // interned instances are shared across threads, so the lazily built caches
use double-checked
Review Comment:
🤖 nit: the comment says "initialization guarded by the monitor" which
implies a `synchronized` block, but neither `getFields()` nor `getFieldMap()`
actually acquires one — they use a benign race (volatile + unchecked assign),
exactly as the inline comments in those methods describe. Could you update this
field-level comment to say "benign race" to stay consistent with the method
bodies?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/avro/AvroRecordContext.java:
##########
@@ -70,7 +71,10 @@ public AvroRecordContext() {
public static Object getFieldValueFromIndexedRecord(
IndexedRecord record,
String fieldName) {
- HoodieSchema currentSchema =
HoodieSchema.fromAvroSchema(record.getSchema());
+ // Interning returns the canonical wrapper for this schema, whose lazily
built field list and
+ // field map survive across calls, so the per-record cost is a cache hit
instead of an
+ // O(schema width) wrapper rebuild.
+ HoodieSchema currentSchema =
HoodieSchemaCache.intern(HoodieSchema.fromAvroSchema(record.getSchema()));
Review Comment:
🤖 Worth noting that line 77 isn't calling `HoodieSchema.fromAvroSchema(...)`
anymore after this PR — it now calls
`AvroToHoodieSchemaCache.intern(record.getSchema())`, which is exactly the
Avro→HoodieSchema cache you're describing. The new `AvroToHoodieSchemaCache` is
a Caffeine `LoadingCache` with `weakKeys`, so steady-state per-record cost is
an identity-based cache hit (records sharing a `Schema` instance hit the same
entry); only the first call for a given Avro `Schema` runs `fromAvroSchema` +
`HoodieSchemaCache.intern`. Equal-but-distinct schema instances still converge
on one canonical `HoodieSchema` via the value-intern step. Is there a specific
call site you're seeing still rebuild per-record, or were you thinking of a
tighter encoding than the Caffeine lookup itself?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]