bvaradar commented on code in PR #7998:
URL: https://github.com/apache/hudi/pull/7998#discussion_r1152767014
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -1063,7 +1063,9 @@ object HoodieSparkSqlWriter {
val recordType = config.getRecordMerger.getRecordType
val shouldCombine = parameters(INSERT_DROP_DUPS.key()).toBoolean ||
- operation.equals(WriteOperationType.UPSERT) ||
+ (operation.equals(WriteOperationType.UPSERT) &&
+ parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_UPSERT.key(),
+ HoodieWriteConfig.COMBINE_BEFORE_UPSERT.defaultValue()).toBoolean) ||
Review Comment:
@danny0405 and myself synced on this. We think there is a valid case. For
example : A setup where there is an upstream hudi table A and Hudi Table B
derives from Table A. A job runs every night, scans all records from table A
(which is guaranteed to be unique) and upserts to Table B (upsert is needed
because Table B is not a log table. It is like everyday snapshot for Table A).
In this case, we need to upsert but allow taking advantage of Table A
uniqueness to avoid pre-combining.
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala:
##########
@@ -960,6 +960,86 @@ class TestHoodieSparkSqlWriter {
assert(spark.read.format("hudi").load(tempBasePath).where("age >=
2000").count() == 10)
}
+ /**
+ * Test upsert for CoW table without precombine field and combine before
upsert disabled.
+ */
+ @Test
+ def testUpsertWithoutPrecombineFieldAndCombineBeforeUpsertDisabled(): Unit =
{
+ val options = Map(DataSourceWriteOptions.TABLE_TYPE.key ->
HoodieTableType.COPY_ON_WRITE.name(),
Review Comment:
@kazdy : Can you also cover MOR case. Even for MOR, we should let upsert
skip pre-combine if the user expects input batch to be unique and wants to skip
this step.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]