bvaradar commented on code in PR #7998:
URL: https://github.com/apache/hudi/pull/7998#discussion_r1152767014


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -1063,7 +1063,9 @@ object HoodieSparkSqlWriter {
     val recordType = config.getRecordMerger.getRecordType
 
     val shouldCombine = parameters(INSERT_DROP_DUPS.key()).toBoolean ||
-      operation.equals(WriteOperationType.UPSERT) ||
+      (operation.equals(WriteOperationType.UPSERT) &&
+        parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_UPSERT.key(),
+          HoodieWriteConfig.COMBINE_BEFORE_UPSERT.defaultValue()).toBoolean) ||

Review Comment:
   @danny0405 and myself synced on this. We think there is a valid case. For 
example : A setup where there is an upstream hudi table A and  Hudi Table B 
derives from Table A. A job runs every night, scans all records from table A 
(which is guaranteed to be unique) and upserts to Table B (upsert is needed 
because Table B is not a log table. It is like everyday snapshot for Table A). 
In this case, we need to upsert but allow taking advantage of Table A 
uniqueness to avoid pre-combining. 



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala:
##########
@@ -960,6 +960,86 @@ class TestHoodieSparkSqlWriter {
     assert(spark.read.format("hudi").load(tempBasePath).where("age >= 
2000").count() == 10)
   }
 
+  /**
+   * Test upsert for CoW table without precombine field and combine before 
upsert disabled.
+   */
+  @Test
+  def testUpsertWithoutPrecombineFieldAndCombineBeforeUpsertDisabled(): Unit = 
{
+    val options = Map(DataSourceWriteOptions.TABLE_TYPE.key -> 
HoodieTableType.COPY_ON_WRITE.name(),

Review Comment:
   @kazdy : Can you also cover MOR case. Even for MOR, we should let upsert 
skip pre-combine if the user expects input batch to be unique and wants to skip 
this step. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to