kazdy commented on code in PR #7998:
URL: https://github.com/apache/hudi/pull/7998#discussion_r1118447028


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -1063,7 +1063,9 @@ object HoodieSparkSqlWriter {
     val recordType = config.getRecordMerger.getRecordType
 
     val shouldCombine = parameters(INSERT_DROP_DUPS.key()).toBoolean ||
-      operation.equals(WriteOperationType.UPSERT) ||
+      (operation.equals(WriteOperationType.UPSERT) &&
+        parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_UPSERT.key(),
+          HoodieWriteConfig.COMBINE_BEFORE_UPSERT.defaultValue()).toBoolean) ||

Review Comment:
   Does this mean that users should never set COMBINE_BEFORE_UPSERT for CoW 
table? I thought this config is something that users are allowed to disable 
deliberately.
   
   If combine before upsert is always required for CoW, how Flink deduplicates 
records for no precombine tables?
   
   My thinking was:
   1. user creates table without precombine field
   2. wants to do update
   3. can't because it's getting exception about missing precombine field
   
   To allow this flow:
   1. user creates table with no precombine field
   2. hudi internally knows no precombine filed was defined and disables 
COMBINE_BEFORE_UPSERT
   3. Hudi does not try to precombine on upsert, it's user responsibility to 
deduplicate incoming dataset
   
   Thought this was a clean way to do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to