voonhous commented on code in PR #14340:
URL: https://github.com/apache/hudi/pull/14340#discussion_r2599292947


##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##########
@@ -410,7 +410,10 @@ public Option<IndexedRecord> getInsertValue(Schema schema, 
Properties properties
       return Option.empty();
     }
 
-    if (schema == null || HOODIE_METADATA_SCHEMA == schema) {
+    // TODO: feature(schema): HoodieSchema change, we removed caching in a few 
areas, during the migration of Avro.Schema -> HoodieSchema.

Review Comment:
   Sorry, what I meant in the TODO is that caching was implicitly removed when 
introducing `HoodieSchemaCache`.
   
    **Before the migration:**
     - When schemas were read from log file headers, they went through 
AvroSchemaCache.intern(Schema)
     - AvroSchemaCache is keyed by Schema (using Schema.equals()), so even if a 
new Schema object was parsed, intern() would return the same cached reference 
as HOODIE_METADATA_SCHEMA
   
   **After the migration:**
     - Schemas are parsed into HoodieSchema via HoodieSchema.parse()
     - HoodieSchemaCache.intern(HoodieSchema) caches the HoodieSchema wrapper
     - When `.toAvroSchema()` is called, it returns the underlying Avro Schema 
which was created by avroParser.parse()   
     - this is a **NEW** object, not the same reference as 
HOODIE_METADATA_SCHEMA
   
   During this transitionary PRs, we have the following mismatch:
     - AvroSchemaCache ensured Avro Schema reference reuse
     - HoodieSchemaCache ensures HoodieSchema wrapper reuse, but the wrapped 
Avro Schema is still a different object
   
   
   The `.equals()` is a temporary bridge during the migration. 
   
   To clarify this, that once HoodieRecordPayload.getInsertValue() accepts 
HoodieSchema instead of Schema, the == comparison can be restored, do you want 
me to add that into the comment?
   
   Addressing this together with the PR will increase the SLOC further.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to