Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

via GitHub Thu, 07 May 2026 20:31:05 -0700


hudi-agent commented on code in PR #18375:
URL: https://github.com/apache/hudi/pull/18375#discussion_r3206083661



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##########
@@ -251,8 +251,13 @@ public static Collection<Pair<String, Long>> 
filterKeysFromFile(StoragePath file
       return Collections.emptyList();
     }
     log.info("Going to filter {} keys from file {}", 
candidateRecordKeys.size(), filePath);
+    // TODO: The AVRO fallback here ignores the merger's configured type. This 
pre-dates
+    //  the Lance PR and is tracked at 
https://github.com/apache/hudi/issues/18496 — respect the
+    //  merger's record type for all formats.
+    HoodieRecord.HoodieRecordType recordType = 
HoodieFileFormat.resolveRecordTypeForExtension(
+        FSUtils.getFileExtension(filePath.toString()), 
HoodieRecord.HoodieRecordType.AVRO);

Review Comment:
   🤖 nit: `filePath` is a `StoragePath`, which already exposes 
`getFileExtension()` (used the same way in 
`HoodieReadHandle.createNewFileReader`). Could you call 
`filePath.getFileExtension()` here for consistency, instead of bouncing through 
`FSUtils.getFileExtension(filePath.toString())`?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/UpdateProcessor.java:
##########
@@ -135,19 +139,59 @@ protected BufferedRecord<T> 
handleNonDeletes(BufferedRecord<T> previousRecord, B
       if (previousRecord == null) {
         // special case for payloads when there is no previous record
         HoodieSchema recordSchema = 
readerContext.getRecordContext().decodeAvroSchema(mergedRecord.getSchemaId());
-        GenericRecord record = 
readerContext.getRecordContext().convertToAvroRecord(mergedRecord.getRecord(), 
recordSchema);
-        HoodieAvroRecord hoodieRecord = new HoodieAvroRecord<>(null, 
HoodieRecordUtils.loadPayload(payloadClass, record, 
mergedRecord.getOrderingValue()));
-        try {
-          if (hoodieRecord.shouldIgnore(recordSchema, properties)) {
-            return null;
-          } else {
-            HoodieSchema readerSchema = 
readerContext.getSchemaHandler().getRequestedSchema();
-            // If the record schema is different from the reader schema, 
rewrite the record using the payload methods to ensure consistency with legacy 
writer paths
-            hoodieRecord.rewriteRecordWithNewSchema(recordSchema, properties, 
readerSchema).toIndexedRecord(readerSchema, properties)
-                .ifPresent(rewrittenRecord -> 
mergedRecord.replaceRecord(readerContext.getRecordContext().convertAvroRecord(rewrittenRecord.getData())));
+        GenericRecord originalAvro = mergedRecord.getOriginalAvroRecord();
+        Schema recordAvroSchema = recordSchema.toAvroSchema();
+
+        // When the merged record carries an originalAvroRecord (populated by 
extractDataFromRecord
+        // for ExpressionPayload in the COW write path via ExtractedData), the 
record is already in
+        // write-schema format with correctly evaluated expressions. Convert 
directly and skip the
+        // payload path.
+        //
+        // NOTE: this branch bypasses shouldIgnore. That is safe today because 
the only payload that
+        // populates originalAvroRecord is ExpressionPayload, which never 
returns shouldIgnore=true.
+        // If a future payload starts producing an originalAvroRecord, it must 
add a shouldIgnore
+        // check here.
+        if (originalAvro != null) {

Review Comment:
   🤖 nit: `handleNonDeletes` is now ~50 lines with several nested branches and 
three sizable comment blocks explaining contracts (originalAvro fast path, 
shouldIgnore-bypass safety, schemaId-after-replaceRecord assumption). Could you 
pull the two arms (`originalAvro != null` and the payload-evaluation branch) 
into small private helpers (e.g. `applyOriginalAvroToMergedRecord(...)` and 
`evaluatePayloadAndReplaceRecord(...)`) so the mainline reads as a clean 
dispatch?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

Reply via email to