hudi-agent commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3426817984


##########
hudi-common/src/main/java/org/apache/hudi/common/schema/AvroToHoodieSchemaCache.java:
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.schema;
+
+import com.github.benmanes.caffeine.cache.Caffeine;
+import com.github.benmanes.caffeine.cache.LoadingCache;
+import org.apache.avro.Schema;
+
+/**
+ * A global cache mapping Avro {@link Schema} instances to their canonical 
{@link HoodieSchema}.
+ *
+ * <p>This is an Avro-schema-keyed view onto {@link HoodieSchemaCache} for 
per-record call sites:
+ * {@code weakKeys} gives identity-based lookups (records of one file share 
the same {@link Schema}
+ * instance), so the hot path is a single cache hit with no wrapper allocation 
or type dispatch.
+ * Misses convert and then value-intern through {@link HoodieSchemaCache}, so 
equal but distinct Avro
+ * schema instances still converge on one canonical {@link HoodieSchema}.
+ *
+ * <p>This is a global cache which works for a JVM lifecycle.
+ */
+public class AvroToHoodieSchemaCache {
+
+  private static final LoadingCache<Schema, HoodieSchema> AVRO_SCHEMA_CACHE =

Review Comment:
   🤖 nit: could you rename this to `HoodieAvroSchemaCache`? Every other class 
in the `org.apache.hudi.common.schema` package follows the `Hoodie`-prefix 
convention (`HoodieSchema`, `HoodieSchemaCache`, `HoodieSchemaField`, …), and a 
future contributor searching for this class would probably look for `Hoodie*` 
first.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java:
##########
@@ -507,9 +507,11 @@ public byte[] getBytes(Schema schema) throws IOException {
       output.writeInt(records.size());
 
       // 3. Write the records
+      // schema is loop-invariant; wrap it once instead of rebuilding the 
HoodieSchema per record
+      HoodieSchema hoodieSchema = HoodieSchema.fromAvroSchema(schema);

Review Comment:
   🤖 nit: have you considered using `AvroToHoodieSchemaCache.intern(schema)` 
here instead of `HoodieSchema.fromAvroSchema`? Every other call site touched in 
this PR switches to `intern`, so this one stands out as inconsistent and a 
reader would be left wondering if the omission is intentional.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/avro/AvroRecordContext.java:
##########
@@ -70,7 +71,10 @@ public AvroRecordContext() {
   public static Object getFieldValueFromIndexedRecord(
       IndexedRecord record,
       String fieldName) {
-    HoodieSchema currentSchema = 
HoodieSchema.fromAvroSchema(record.getSchema());
+    // Interning returns the canonical wrapper for this schema, whose lazily 
built field list and
+    // field map survive across calls, so the per-record cost is a cache hit 
instead of an
+    // O(schema width) wrapper rebuild.
+    HoodieSchema currentSchema = 
HoodieSchemaCache.intern(HoodieSchema.fromAvroSchema(record.getSchema()));

Review Comment:
   🤖 Worth noting that the line you're pointing at IS the cache — 
`AvroToHoodieSchemaCache.intern(...)` is a Caffeine `LoadingCache<Schema, 
HoodieSchema>` with `weakKeys()`, so on the hot path it's a single 
identity-keyed lookup with no wrapper allocation. The 
`HoodieSchema.fromAvroSchema(...)` call only runs inside the loader on first 
sight of each distinct `Schema` instance and then value-interns through 
`HoodieSchemaCache`, so per-record cost is effectively a cache hit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to