hudi-agent commented on code in PR #18967:
URL: https://github.com/apache/hudi/pull/18967#discussion_r3426123830
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -313,8 +315,10 @@ private static void addVectorColumnName(String s, int
start, int end, Set<String
private Schema avroSchema;
private HoodieSchemaType type;
- private transient List<HoodieSchemaField> fields;
- private transient Map<String, HoodieSchemaField> fieldMap;
+ // interned instances are shared across threads, so the lazily built caches
use double-checked
Review Comment:
🤖 nit: this comment says "double-checked locking ... initialization guarded
by the monitor," but `getFields()`/`getFieldMap()` have no `synchronized` block
— it's a benign-race pattern with safe publication via `volatile` (as the
inline comments below correctly describe). Could you reword to match, e.g.
"lazily computed; benign race on init, `volatile` ensures safe publication"?
Otherwise future readers will go looking for a monitor that isn't there.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##########
@@ -22,7 +22,7 @@ import
org.apache.hudi.HoodieSchemaConversionUtils.{convertHoodieSchemaToDataTyp
import org.apache.hudi.SparkAdapterSupport.sparkAdapter
import org.apache.hudi.avro.HoodieAvroUtils
import org.apache.hudi.common.model.{DefaultHoodieRecordPayload,
HoodiePayloadProps, HoodieRecord, HoodieRecordPayload,
OverwriteWithLatestAvroPayload}
-import org.apache.hudi.common.schema.{HoodieSchema, HoodieSchemaUtils}
+import org.apache.hudi.common.schema.{AvroToHoodieSchemaCache, HoodieSchema,
HoodieSchemaUtils}
Review Comment:
🤖 Same issue as `SparkFileFormatInternalRecordContext.scala` —
`AvroToHoodieSchemaCache` was removed in the last commit but the import here
and the 7 call sites (lines 119, 148, 209, 234, 246, and the two on 318) still
reference it. They should all become `HoodieSchema.fromAvroSchema(...)` now
that interning lives inside that factory.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRecordContext.scala:
##########
@@ -47,7 +47,7 @@ trait SparkFileFormatInternalRecordContext extends
BaseSparkInternalRecordContex
* @return An [[InternalRow]].
*/
override def convertAvroRecord(avroRecord: IndexedRecord): InternalRow = {
- val schema = HoodieSchema.fromAvroSchema(avroRecord.getSchema)
+ val schema = AvroToHoodieSchemaCache.intern(avroRecord.getSchema)
Review Comment:
🤖 nit: with `HoodieSchema.fromAvroSchema` now interning internally, having
call sites split between `AvroToHoodieSchemaCache.intern(...)` and
`HoodieSchema.fromAvroSchema(...)` (e.g. still used in `HoodieAvroDataBlock`)
gives us two equivalent entry points and invites "which one do I call?"
confusion. Could we settle on a single canonical API across the codebase?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRecordContext.scala:
##########
@@ -21,7 +21,7 @@ package org.apache.hudi
import org.apache.avro.generic.{GenericRecord, IndexedRecord}
import org.apache.hudi.common.engine.RecordContext
-import org.apache.hudi.common.schema.HoodieSchema
+import org.apache.hudi.common.schema.{AvroToHoodieSchemaCache, HoodieSchema}
Review Comment:
🤖 Looks like the last commit that folded the cache into
`HoodieSchema.fromAvroSchema` and deleted `AvroToHoodieSchemaCache.java` missed
updating this file — both the import on line 24 and the call on line 50 still
reference the now-deleted class, which should break the Spark client build.
Could you swap these to `HoodieSchema.fromAvroSchema(...)` (which now interns
internally)?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]