hudi-agent commented on code in PR #18375:
URL: https://github.com/apache/hudi/pull/18375#discussion_r3198725216


##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/UpdateProcessor.java:
##########
@@ -135,19 +139,59 @@ protected BufferedRecord<T> 
handleNonDeletes(BufferedRecord<T> previousRecord, B
       if (previousRecord == null) {
         // special case for payloads when there is no previous record
         HoodieSchema recordSchema = 
readerContext.getRecordContext().decodeAvroSchema(mergedRecord.getSchemaId());
-        GenericRecord record = 
readerContext.getRecordContext().convertToAvroRecord(mergedRecord.getRecord(), 
recordSchema);
-        HoodieAvroRecord hoodieRecord = new HoodieAvroRecord<>(null, 
HoodieRecordUtils.loadPayload(payloadClass, record, 
mergedRecord.getOrderingValue()));
-        try {
-          if (hoodieRecord.shouldIgnore(recordSchema, properties)) {
-            return null;
-          } else {
-            HoodieSchema readerSchema = 
readerContext.getSchemaHandler().getRequestedSchema();
-            // If the record schema is different from the reader schema, 
rewrite the record using the payload methods to ensure consistency with legacy 
writer paths
-            hoodieRecord.rewriteRecordWithNewSchema(recordSchema, properties, 
readerSchema).toIndexedRecord(readerSchema, properties)
-                .ifPresent(rewrittenRecord -> 
mergedRecord.replaceRecord(readerContext.getRecordContext().convertAvroRecord(rewrittenRecord.getData())));
+        GenericRecord originalAvro = mergedRecord.getOriginalAvroRecord();

Review Comment:
   🤖 nit: with the new originalAvro branch, `handleNonDeletes` is now ~50 lines 
and four levels deep (if → else → try → if/else). Could you extract the else 
branch into a helper like `rewriteAndReplaceFromPayload(mergedRecord, 
recordSchema)` that returns the boolean "emit/suppress" decision? That would 
let the top of the method read as a clean two-way switch on `originalAvro != 
null`.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/engine/ExtractedData.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.engine;
+
+import org.apache.avro.generic.GenericRecord;
+
+import javax.annotation.Nullable;
+
+/**
+ * Result of {@link RecordContext#extractDataFromRecord} carrying the 
engine-native row plus,
+ * when extraction went through an Avro payload (e.g. {@code 
ExpressionPayload} in SQL MERGE INTO),
+ * the original Avro record produced by {@code HoodieRecord#toIndexedRecord}.
+ *
+ * <p>The original Avro record is propagated to {@link 
org.apache.hudi.common.table.read.BufferedRecord}
+ * so downstream code (e.g. {@code UpdateProcessor}) can decode the payload 
bytes with the correct
+ * source schema instead of re-serializing the engine-native row through a 
possibly-mismatched schema.
+ */
+public final class ExtractedData<T> {
+
+  private final T data;
+  @Nullable
+  private final GenericRecord originalAvroRecord;
+
+  private ExtractedData(T data, @Nullable GenericRecord originalAvroRecord) {
+    this.data = data;
+    this.originalAvroRecord = originalAvroRecord;
+  }
+
+  public static <T> ExtractedData<T> of(T data) {
+    return new ExtractedData<>(data, null);
+  }
+
+  public static <T> ExtractedData<T> of(T data, @Nullable GenericRecord 
originalAvroRecord) {

Review Comment:
   🤖 nit: the two static factories `of(T)` and `of(T, GenericRecord)` differ 
only by the optional avro hint — could you drop the single-arg overload and 
just have callers pass `null` (or rename the two-arg one to `of(T, 
GenericRecord)` and keep the no-hint variant clearly named, e.g. 
`withoutAvroHint(T)`)? As written, picking between them at call sites is a 
small distraction since `of(x)` and `of(x, null)` produce identical objects.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieReadHandle.java:
##########
@@ -68,14 +70,18 @@ protected HoodieBaseFile getLatestBaseFile() {
   }
 
   protected HoodieFileReader createNewFileReader() throws IOException {
-    return HoodieIOFactory.getIOFactory(hoodieTable.getStorage())
-        .getReaderFactory(this.config.getRecordMerger().getRecordType())
-        .getFileReader(config, getLatestBaseFile().getStoragePath());
+    return createNewFileReader(getLatestBaseFile());
   }
 
   protected HoodieFileReader createNewFileReader(HoodieBaseFile 
hoodieBaseFile) throws IOException {
+    String extension = hoodieBaseFile.getStoragePath().getFileExtension();
+    HoodieFileFormat format = 
HoodieFileFormat.fromFileExtensionOrNull(extension);
+    HoodieRecord.HoodieRecordType mergerRecordType = 
this.config.getRecordMerger().getRecordType();
+    HoodieRecord.HoodieRecordType recordType = format != null

Review Comment:
   🤖 nit: this same `extension → fromFileExtensionOrNull → resolveRecordType, 
fall back to merger type` snippet now appears here, in 
`HoodieIndexUtils.filterKeysFromFile`, and in 
`HoodieFileWriterFactory.getFileWriter`. Could you fold it into a static helper 
on `HoodieFileFormat` (e.g. `resolveRecordTypeForExtension(String, 
HoodieRecordType fallback)`)? The three call sites would become a one-liner and 
stay in lockstep.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -327,7 +327,11 @@ public class HoodieWriteConfig extends HoodieConfig {
   public static final ConfigProperty<HoodieFileFormat> BASE_FILE_FORMAT = 
ConfigProperty
       .key("hoodie.base.file.format")
       .defaultValue(HoodieFileFormat.PARQUET)
-      .withValidValues(HoodieFileFormat.PARQUET.name(), 
HoodieFileFormat.ORC.name(), HoodieFileFormat.HFILE.name())
+      .withValidValues(
+          HoodieFileFormat.PARQUET.name(),
+          HoodieFileFormat.ORC.name(),
+          HoodieFileFormat.HFILE.name(),
+          HoodieFileFormat.LANCE.name())

Review Comment:
   _⚠️ Potential issue_ | _🟠 Major_
   
   **LANCE is now accepted here, but `getMaxFileSize()` still rejects it.**
   
   Lines 2369-2379 in this same class still only handle `PARQUET`, `ORC`, and 
`HFILE` before throwing. After advertising `HoodieFileFormat.LANCE` as a valid 
base format, any path that derives max file size from 
`config.getBaseFileFormat()` can now fail at runtime for Lance tables. Please 
either add a `LANCE` branch there or block `LANCE` until its sizing behavior is 
defined.
   
   <details>
   <summary>🤖 Prompt for AI Agents</summary>
   
   ```
   Verify each finding against the current code and only fix it if needed.
   
   In
   
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java`
   around lines 322 - 326, The config now advertises HoodieFileFormat.LANCE as a
   valid base format but getMaxFileSize (in class HoodieWriteConfig) still only
   handles PARQUET, ORC and HFILE and will throw for LANCE; update 
getMaxFileSize
   to either add a LANCE branch with the appropriate sizing policy (e.g., derive
   from config.getBaseFileFormat() or a new lance-specific config key) or
   explicitly reject/validate HoodieFileFormat.LANCE earlier so it cannot be 
used
   until sizing behavior is defined; locate the getMaxFileSize method and the
   usages of config.getBaseFileFormat() and implement the LANCE case or add a 
clear
   validation/error message preventing LANCE selection.
   ```
   
   </details>
   
   <!-- 
fingerprinting:phantom:medusa:grasshopper:eb97e8e1-02f7-4cb8-914d-86726e9106c5 
-->
   
   <!-- This is an auto-generated comment by CodeRabbit -->
   
   — *CodeRabbit* 
([original](https://github.com/hudi-agent/hudi/pull/3#discussion_r3108757025)) 
(source:comment#3108757025)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to