jonvex commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1249618007
##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##########
@@ -348,7 +387,7 @@ case class MergeIntoHoodieTableCommand(mergeInto:
MergeIntoTable) extends Hoodie
* expressions to the ExpressionPayload#getInsertValue.
*/
private def executeUpsert(sourceDF: DataFrame, parameters: Map[String,
String]): Unit = {
- val operation = if
(StringUtils.isNullOrEmpty(parameters.getOrElse(PRECOMBINE_FIELD.key, ""))) {
+ val operation = if
(StringUtils.isNullOrEmpty(parameters.getOrElse(PRECOMBINE_FIELD.key, "")) &&
updatingActions.isEmpty) {
Review Comment:
Previously we were just doing an insert operation if the precombine was not
set. Because we now do the join and get the meta cols for the existing records,
the resulting behavior was very strange. If you look at the last test in
TestMergeIntoTable2.scala "Test only insert for source table in dup key without
preCombineField", what was happening is that we would end up with 5 records as
the final result. The record id = 1 would match on both of the existing id 1
records in the table so the df that would get sent to spark sql writer would
have 2 of that and 1 of id = 3 and then insert everything. The previous
behavior was that only 1 additional copy would be inserted. This behavior does
not make sense because first of all, there are 2 matching records, and
secondly, it is "update set *" not "insert set *". I think this behavior makes
more sense because we do an update now for matching records.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]