Re: [PR] perf(common): Avoid per-record HoodieSchema rebuilds on Avro read/merge paths [hudi]

via GitHub Tue, 16 Jun 2026 23:46:35 -0700


hudi-agent commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3426123830



##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -313,8 +315,10 @@ private static void addVectorColumnName(String s, int 
start, int end, Set<String
 
   private Schema avroSchema;
   private HoodieSchemaType type;
-  private transient List<HoodieSchemaField> fields;
-  private transient Map<String, HoodieSchemaField> fieldMap;
+  // interned instances are shared across threads, so the lazily built caches 
use double-checked

Review Comment:
   🤖 nit: this comment says "double-checked locking ... initialization guarded 
by the monitor," but `getFields()`/`getFieldMap()` have no `synchronized` block 
— it's a benign-race pattern with safe publication via `volatile` (as the 
inline comments below correctly describe). Could you reword to match, e.g. 
"lazily computed; benign race on init, `volatile` ensures safe publication"? 
Otherwise future readers will go looking for a monitor that isn't there.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##########
@@ -22,7 +22,7 @@ import 
org.apache.hudi.HoodieSchemaConversionUtils.{convertHoodieSchemaToDataTyp
 import org.apache.hudi.SparkAdapterSupport.sparkAdapter
 import org.apache.hudi.avro.HoodieAvroUtils
 import org.apache.hudi.common.model.{DefaultHoodieRecordPayload, 
HoodiePayloadProps, HoodieRecord, HoodieRecordPayload, 
OverwriteWithLatestAvroPayload}
-import org.apache.hudi.common.schema.{HoodieSchema, HoodieSchemaUtils}
+import org.apache.hudi.common.schema.{AvroToHoodieSchemaCache, HoodieSchema, 
HoodieSchemaUtils}

Review Comment:
   🤖 Same issue as `SparkFileFormatInternalRecordContext.scala` — 
`AvroToHoodieSchemaCache` was removed in the last commit but the import here 
and the 7 call sites (lines 119, 148, 209, 234, 246, and the two on 318) still 
reference it. They should all become `HoodieSchema.fromAvroSchema(...)` now 
that interning lives inside that factory.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRecordContext.scala:
##########
@@ -47,7 +47,7 @@ trait SparkFileFormatInternalRecordContext extends 
BaseSparkInternalRecordContex
    * @return An [[InternalRow]].
    */
   override def convertAvroRecord(avroRecord: IndexedRecord): InternalRow = {
-    val schema = HoodieSchema.fromAvroSchema(avroRecord.getSchema)
+    val schema = AvroToHoodieSchemaCache.intern(avroRecord.getSchema)

Review Comment:
   🤖 nit: with `HoodieSchema.fromAvroSchema` now interning internally, having 
call sites split between `AvroToHoodieSchemaCache.intern(...)` and 
`HoodieSchema.fromAvroSchema(...)` (e.g. still used in `HoodieAvroDataBlock`) 
gives us two equivalent entry points and invites "which one do I call?" 
confusion. Could we settle on a single canonical API across the codebase?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRecordContext.scala:
##########
@@ -21,7 +21,7 @@ package org.apache.hudi
 
 import org.apache.avro.generic.{GenericRecord, IndexedRecord}
 import org.apache.hudi.common.engine.RecordContext
-import org.apache.hudi.common.schema.HoodieSchema
+import org.apache.hudi.common.schema.{AvroToHoodieSchemaCache, HoodieSchema}

Review Comment:
   🤖 Looks like the last commit that folded the cache into 
`HoodieSchema.fromAvroSchema` and deleted `AvroToHoodieSchemaCache.java` missed 
updating this file — both the import on line 24 and the call on line 50 still 
reference the now-deleted class, which should break the Spark client build. 
Could you swap these to `HoodieSchema.fromAvroSchema(...)` (which now interns 
internally)?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf(common): Avoid per-record HoodieSchema rebuilds on Avro read/merge paths [hudi]

Reply via email to