kazdy commented on code in PR #7998:
URL: https://github.com/apache/hudi/pull/7998#discussion_r1118447028
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -1063,7 +1063,9 @@ object HoodieSparkSqlWriter {
val recordType = config.getRecordMerger.getRecordType
val shouldCombine = parameters(INSERT_DROP_DUPS.key()).toBoolean ||
- operation.equals(WriteOperationType.UPSERT) ||
+ (operation.equals(WriteOperationType.UPSERT) &&
+ parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_UPSERT.key(),
+ HoodieWriteConfig.COMBINE_BEFORE_UPSERT.defaultValue()).toBoolean) ||
Review Comment:
Does this mean that users should never set COMBINE_BEFORE_UPSERT for CoW
table? I thought this config is something that users are allowed to disable
deliberately. At least I was surprised it does not work as described in docs.
If combine before upsert is always required for CoW, how Flink deduplicates
records for no precombine tables?
My thinking was:
1. user creates table without precombine field
2. wants to do update
3. can't because it's getting exception about missing precombine field
To allow this flow:
1. user creates table with no precombine field
2. hudi internally knows no precombine filed was defined and disables
COMBINE_BEFORE_UPSERT
3. Hudi does not try to precombine on upsert, it's user responsibility to
deduplicate incoming dataset
But for this to work, this change would need to be done first.
Thought this was a clean way to do it because error is thrown from here:
https://github.com/apache/hudi/blob/cde8d4ffa1cae261731d94c2a0117ece6473a882/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L1096-L1098
after condition on shouldCombine val
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]