hudi-agent commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3408525827


##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -313,8 +313,10 @@ private static void addVectorColumnName(String s, int 
start, int end, Set<String
 
   private Schema avroSchema;
   private HoodieSchemaType type;
-  private transient List<HoodieSchemaField> fields;
-  private transient Map<String, HoodieSchemaField> fieldMap;
+  // interned instances are shared across threads, so the lazily built caches 
use double-checked

Review Comment:
   🤖 nit: the comment says "initialization guarded by the monitor" which 
implies a `synchronized` block, but neither `getFields()` nor `getFieldMap()` 
actually acquires one — they use a benign race (volatile + unchecked assign), 
exactly as the inline comments in those methods describe. Could you update this 
field-level comment to say "benign race" to stay consistent with the method 
bodies?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/avro/AvroRecordContext.java:
##########
@@ -70,7 +71,10 @@ public AvroRecordContext() {
   public static Object getFieldValueFromIndexedRecord(
       IndexedRecord record,
       String fieldName) {
-    HoodieSchema currentSchema = 
HoodieSchema.fromAvroSchema(record.getSchema());
+    // Interning returns the canonical wrapper for this schema, whose lazily 
built field list and
+    // field map survive across calls, so the per-record cost is a cache hit 
instead of an
+    // O(schema width) wrapper rebuild.
+    HoodieSchema currentSchema = 
HoodieSchemaCache.intern(HoodieSchema.fromAvroSchema(record.getSchema()));

Review Comment:
   🤖 Worth noting that line 77 isn't calling `HoodieSchema.fromAvroSchema(...)` 
anymore after this PR — it now calls 
`AvroToHoodieSchemaCache.intern(record.getSchema())`, which is exactly the 
Avro→HoodieSchema cache you're describing. The new `AvroToHoodieSchemaCache` is 
a Caffeine `LoadingCache` with `weakKeys`, so steady-state per-record cost is 
an identity-based cache hit (records sharing a `Schema` instance hit the same 
entry); only the first call for a given Avro `Schema` runs `fromAvroSchema` + 
`HoodieSchemaCache.intern`. Equal-but-distinct schema instances still converge 
on one canonical `HoodieSchema` via the value-intern step. Is there a specific 
call site you're seeing still rebuild per-record, or were you thinking of a 
tighter encoding than the Caffeine lookup itself?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to