Davis-Zhang-Onehouse commented on code in PR #12242:
URL: https://github.com/apache/hudi/pull/12242#discussion_r1841230992


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##########
@@ -164,6 +163,34 @@ class ExpressionPayload(@transient record: GenericRecord,
     }
   }
 
+  private def doRecordMerge(incomingRecord: GenericRecord,
+                            existingRecord: IndexedRecord,
+                            schema: Schema,
+                            properties: Properties): HOption[IndexedRecord] = {
+    val tablePayloadClass = 
properties.getProperty(PAYLOAD_ORIGINAL_AVRO_PAYLOAD)
+    if 
(tablePayloadClass.equals(classOf[OverwriteWithLatestAvroPayload].getName)) {
+      HOption.of(incomingRecord)
+    } else if 
(tablePayloadClass.equals(classOf[DefaultHoodieRecordPayload].getName)) {
+      if (needUpdatingPersistedRecord(existingRecord, incomingRecord, 
properties)) {
+        HOption.of(incomingRecord)
+      } else {
+        HOption.of(existingRecord)
+      }
+    } else {
+      val orderingField = ConfigUtils.getOrderingField(properties)
+      if (StringUtils.isNullOrEmpty(orderingField)) {
+        HOption.of(incomingRecord)
+      } else {
+        val consistentLogicalTimestampEnabled = 
properties.getProperty(KeyGeneratorOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.key,
+          
KeyGeneratorOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED.defaultValue).toBoolean
+        val incomingRecordPayload = 
HoodieRecordUtils.loadPayload(tablePayloadClass, incomingRecord,

Review Comment:
   I was thinking of caching to avoid the penalty. Here is my thinking process:
   
   1. payload object can be stateful. 
org.apache.hudi.common.model.DefaultHoodieRecordPayload#isDeleteComputed is an 
example. For any customized third party payload class implementation, caching 
must be banned because of that. You never know what states has changed after an 
API call.
   
   2. On one hand, the design of payload class has a tight coupling between the 
raw avro record and "payload handler" defining the strategy on how raw data is 
manipulated. On the other hand, the expression payload of MIT does not know 
what's the actual strategy to use at construction time, it require a loose 
coupling between the raw data and the strategy which is not possible without a 
larger scope of class redesign and code refactor.
   
   3. From the perspective of making incremental progress, this PR is a net win 
and does 0 harm functional-wise. It ensures functional correctness where it 
honors the table config specified and for default payload and overwrite payload 
there is no perf penalty.
   
   I'm proposing to have a follow up for any larger code refactor and this PR 
focuses only on the correctness which is a 0 to 1 thing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to