kazdy commented on code in PR #7998:
URL: https://github.com/apache/hudi/pull/7998#discussion_r1118447028


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -1063,7 +1063,9 @@ object HoodieSparkSqlWriter {
     val recordType = config.getRecordMerger.getRecordType
 
     val shouldCombine = parameters(INSERT_DROP_DUPS.key()).toBoolean ||
-      operation.equals(WriteOperationType.UPSERT) ||
+      (operation.equals(WriteOperationType.UPSERT) &&
+        parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_UPSERT.key(),
+          HoodieWriteConfig.COMBINE_BEFORE_UPSERT.defaultValue()).toBoolean) ||

Review Comment:
   Does this mean that users should never set COMBINE_BEFORE_UPSERT for CoW 
table? I thought this config is something that users are allowed to disable 
deliberately. At least I was surprised it does not work as described in docs.
   
   If combine before upsert is always required for CoW, how Flink deduplicates 
records for no precombine tables?
   
   My thinking was:
   1. user creates table without precombine field
   2. wants to do update
   3. can't because it's getting exception about missing precombine field
   
   To allow this flow:
   1. user creates table with no precombine field
   2. hudi internally knows no precombine filed was defined and disables 
COMBINE_BEFORE_UPSERT
   3. Hudi does not try to precombine on upsert, it's user responsibility to 
deduplicate incoming dataset
   
   But for this to work, this change would need to be done first.
   
   Thought this was a clean way to do it because error is thrown from here:
   
https://github.com/apache/hudi/blob/cde8d4ffa1cae261731d94c2a0117ece6473a882/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L1096-L1098
   
   after condition on shouldCombine val



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to